LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection
@ 2022-01-27 17:54 ira.weiny
  2022-01-27 17:54 ` [PATCH V8 01/44] entry: Create an internal irqentry_exit_cond_resched() call ira.weiny
                   ` (43 more replies)
  0 siblings, 44 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

NOTES:

I'm sending these to Intel reviewers to get their opinions on this code in a
public forum.  All feedback from V7 has been addressed in addition to a number
of changes.

Peter Anvin suggested that saving and restoring the MSR in the assembly code
might be better.  For him using the C code seemed a bit late for his taste.
But the MSR is saved and restored within the common entry code prior to any
general code being called.  I can't see the general entry code requiring
special PKS access.  It certainly does not for the PMEM use case.

I've also considered changing the name of pt_regs_extended and
pt_regs_auxiliary to something more generic.  Technically these are not 'ptrace
registers' but the names seem ok since they do extend the pt_regs within the C
code.  So I left the names alone.

Finally, I've changed the patches to be smaller and more direct to 1 change in
the code.  This helped to clarify why each particular change was made but it
also created more interdependence between the patches.  This does not mean that
the series is not bisectable but it does mean that some patches will not do
anything useful other than lay ground work for patches to follow.  I hope this
is ok.



Changes for V8

Feedback from Thomas
	* clean up noinstr mess
	* Fix static PKEY allocation mess
	* Ensure all functions are consistently named.
	* Split up patches to do 1 thing per patch
	* pkey_update_pkval() implementation
	* Streamline the use of pks_write_pkrs() by not disabling preemption
		- Leave this to the callers who require it.
		- Use documentation and lockdep to prevent errors
	* Clean up commit messages to explain in detail _why_ each patch is
		there.

Feedback from Dave H.
	* Leave out pks_mk_readonly() as it is not used by the PMEM use case

Feedback from Peter Anvin
	* Replace pks_abandon_pkey() with pks_update_exception()
		This is an even greater simplification in that it no longer
		attempts to shield users from faults.  As the main use case for
		abandoning a key was to allow a system to continue running even
		with an error.  This should be a rare event so the performance
		should not be an issue.

* Simplify ARCH_ENABLE_SUPERVISOR_PKEYS

* Update PKS Test code
	- Add default value test
	- Split up the test code into patches which follow each feature
	  addition
	- simplify test code processing
	- ensure consistent reporting of errors.

* Ensure all entry points to the PKS code are protected by
	cpu_feature_enabled(X86_FEATURE_PKS)
	- At the same time make sure non-entry points or sub-functions to the
	  PKS code are not _unnecessarily_ protected by the feature check

* Update documentation
	- Use kernel docs to place the docs with the code for easier internal
	  developer use

* Adjust the PMEM use cases for the core changes

* Split the PMEM patches up to be 1 change per patch and help clarify review

* Review all header files and remove those no longer needed

* Review/update/clarify all commit messages



PKS/PMEM Stray write protection
===============================

This series is broken into 2 parts.

	1) Introduce Protection Key Supervisor (PKS)
	2) Use PKS to protect PMEM from stray writes

Introduce Protection Key Supervisor (PKS)
-----------------------------------------

PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to pages beyond the normal paging protections.  PKS works in a
similar fashion to user space pkeys, PKU.  As with PKU, supervisor pkeys are
checked in addition to normal paging protections.  And page mappings are
assigned to a pkey domain by setting a 4 bit pkey in the PTE of that mapping.

Unlike PKU, permissions are changed via a MSR update.  This update avoids TLB
flushes making this an efficient way to alter protections vs PTE updates.

XSAVE is not supported for the PKRS MSR.  Therefore the implementation saves
and restores the MSR across context switches and during exceptions within
software.  Nested exceptions are supported by each exception getting a new
PKS state.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections because PTEs naturally have a pkey value of 0.

Other keys, (1-15) are statically allocated by kernel users.  This is done by
adding an entry to 'enum pks_pkey_consumers' and a corresponding default value
in PKS_INIT_VALUE.

2 users of keys, PKS_TEST and PMEM stray write protection, are included in this
series.  When the number of users grows larger the sharing of keys will need to
be resolved depending on the needs of the users at that time.  Many methods
have been contemplated but the number of kernel users and use cases envisioned
is still quite small.  Much less than the 15 available keys.

The following are key attributes of PKS.

	1) Fast switching of permissions
		1a) Prevents access without page table manipulations
		1b) No TLB flushes required
	2) Works on a per thread basis, thus allowing protections to be
	   preserved on threads which are not actively accessing data through
	   the mapping.

PKS is available with 4 and 5 level paging.  For this and simplicity of
implementation, the feature is restricted to x86_64.


Use PKS to protect PMEM from stray writes
-----------------------------------------

DAX leverages the direct-map to enable 'struct page' services for PMEM.  Given
that PMEM capacity may be an order of magnitude higher capacity than System RAM
it presents a large vulnerability surface to stray writes.  Such a stray write
becomes a silent data corruption bug.

Stray pointers to System RAM may result in a crash or other undesirable
behavior which, while unfortunate, are usually recoverable with a reboot.
Stray writes to PMEM are permanent in nature and thus are more likely to result
in permanent user data loss.  Given that PMEM access from the kernel is limited
to a constrained set of locations (PMEM driver, Filesystem-DAX, direct-I/O, and
any properly kmap'ed page), it is amenable to PKS protection.

Set up an infrastructure for extra device access protection. Then implement the
protection using the new Protection Keys Supervisor (PKS) on architectures
which support it.

Because PMEM pages are all associated with a struct dev_pagemap and flags in
struct page are valuable the flag of protecting memory can be stored in struct
dev_pagemap.  All PMEM is protected by the same pkey.  So a single flag is all
that is needed in each page to indicate protection.

General access in the kernel is supported by modifying the kmap infrastructure
which can detect if a page is pks protected and enable access until the
corresponding unmap is called.

Because PKS is a thread local mechanism and because kmap was never really
intended to create a long term mappings, this implementation does not support
the kmap()/kunmap() calls.  Calling kmap() on a PMEM protected page is flagged
with a warning with a trace from that call stack.  Access to that mapping may
or may not cause a fault depending on if they are within the thread which
created the mapping.

Originally this series modified many of the kmap call sites to indicate they
were thread local.[1]  And an attempt to support kmap()[2] was made.  But now
that kmap_local_page() has been developed[3] and in more wide spread use,
kmap() can safely be left unsupported.

Furthermore, handling invalid access to these pages is configurable via a new
module parameter memremap.pks_fault_mode.  2 modes are suported.

	'relaxed' (default) -- WARN_ONCE, disable the protection and allow
	                       access

	'strict' -- prevent any unguarded access to a protected dev_pagemap
		    range

The fault handler detects the invalid access and applies the above
configuration.  Relaxed warns of the condition while allowing the access to
continue.  Where 'strict' oopes the kernel.  This 'safety valve' feature has
already been useful in the development of this feature.

Due to the nesting nature of kmap and the pmem direct accesses, in addition to
the fact that the pkey is a single global domain, reference counting must be
employed to ensure that access remain enabled on a thread which may be nesting
access and or creating access to multiple PMEM pages at a time.  The reference
count is stored the struct thread_struct.

Reference counting is not needed during exceptions as normal PMEM accesses are
never done during exceptions.



[1] https://lore.kernel.org/lkml/20201009195033.3208459-1-ira.weiny@intel.com/

[2] https://lore.kernel.org/lkml/87mtycqcjf.fsf@nanos.tec.linutronix.de/

[3] https://lore.kernel.org/lkml/20210128061503.1496847-1-ira.weiny@intel.com/
    https://lore.kernel.org/lkml/20210210062221.3023586-1-ira.weiny@intel.com/
    https://lore.kernel.org/lkml/20210205170030.856723-1-ira.weiny@intel.com/
    https://lore.kernel.org/lkml/20210217024826.3466046-1-ira.weiny@intel.com/

[4] https://lore.kernel.org/lkml/20201106232908.364581-1-ira.weiny@intel.com/

[5] https://lore.kernel.org/lkml/20210322053020.2287058-1-ira.weiny@intel.com/

[6] https://lore.kernel.org/lkml/20210331191405.341999-1-ira.weiny@intel.com/


Fenghua Yu (1):
mm/pkeys: Define PKS page table macros

Ira Weiny (42):
entry: Create an internal irqentry_exit_cond_resched() call
Documentation/protection-keys: Clean up documentation for User Space
pkeys
x86/pkeys: Create pkeys_common.h
x86/pkeys: Add additional PKEY helper macros
x86/fpu: Refactor arch_set_user_pkey_access()
mm/pkeys: Add Kconfig options for PKS
x86/pkeys: Add PKS CPU feature bit
x86/fault: Adjust WARN_ON for PKey fault
x86/pkeys: Enable PKS on cpus which support it
Documentation/pkeys: Add initial PKS documentation
mm/pkeys: Define static PKS key array and default values
mm/pkeys: Add initial PKS Test code
x86/pkeys: Introduce pks_write_pkrs()
x86/pkeys: Preserve the PKS MSR on context switch
mm/pkeys: Introduce pks_mk_readwrite()
mm/pkeys: Introduce pks_mk_noaccess()
x86/fault: Add a PKS test fault hook
mm/pkeys: PKS Testing, add pks_mk_*() tests
mm/pkeys: Add PKS test for context switching
x86/entry: Add auxiliary pt_regs space
entry: Pass pt_regs to irqentry_exit_cond_resched()
entry: Add architecture auxiliary pt_regs save/restore calls
x86/entry: Define arch_{save|restore}_auxiliary_pt_regs()
x86/pkeys: Preserve PKRS MSR across exceptions
x86/fault: Print PKS MSR on fault
mm/pkeys: Add PKS exception test
mm/pkeys: Introduce pks_update_exception()
mm/pkeys: Test setting a PKS key in a custom fault callback
mm/pkeys: Add pks_available()
memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION
memremap_pages: Introduce pgmap_protection_available()
memremap_pages: Introduce a PGMAP_PROTECTION flag
memremap_pages: Introduce devmap_protected()
memremap_pages: Reserve a PKS PKey for eventual use by PMEM
memremap_pages: Set PKS PKey in PTEs if PGMAP_PROTECTIONS is requested
memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls
memremap_pages: Add memremap.pks_fault_mode
memremap_pages: Add pgmap_protection_flag_invalid()
kmap: Ensure kmap works for devmap pages
dax: Stray access protection for dax_direct_access()
nvdimm/pmem: Enable stray access protection
devdax: Enable stray access protection

Rick Edgecombe (1):
mm/pkeys: Introduce PKS fault callbacks

.../admin-guide/kernel-parameters.txt | 14 +
Documentation/core-api/protection-keys.rst | 135 +++-
arch/x86/Kconfig | 6 +
arch/x86/entry/calling.h | 20 +
arch/x86/entry/common.c | 2 +-
arch/x86/entry/entry_64.S | 22 +
arch/x86/entry/entry_64_compat.S | 6 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/entry-common.h | 15 +
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/pgtable_types.h | 22 +
arch/x86/include/asm/pkeys.h | 2 +
arch/x86/include/asm/pkeys_common.h | 15 +
arch/x86/include/asm/pkru.h | 16 +-
arch/x86/include/asm/pks.h | 51 ++
arch/x86/include/asm/processor.h | 17 +-
arch/x86/include/asm/ptrace.h | 22 +
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/asm-offsets_64.c | 15 +
arch/x86/kernel/cpu/common.c | 2 +
arch/x86/kernel/fpu/xstate.c | 22 +-
arch/x86/kernel/head_64.S | 6 +
arch/x86/kernel/process_64.c | 3 +
arch/x86/mm/fault.c | 32 +-
arch/x86/mm/pkeys.c | 312 +++++++-
drivers/dax/device.c | 2 +
drivers/dax/super.c | 54 ++
drivers/md/dm-writecache.c | 8 +-
drivers/nvdimm/pmem.c | 52 +-
fs/dax.c | 8 +
fs/fuse/virtio_fs.c | 2 +
include/linux/dax.h | 8 +
include/linux/entry-common.h | 15 +-
include/linux/highmem-internal.h | 5 +
include/linux/memremap.h | 1 +
include/linux/mm.h | 90 +++
include/linux/pgtable.h | 4 +
include/linux/pkeys.h | 54 ++
include/linux/pks-keys.h | 64 ++
include/linux/sched.h | 7 +
include/uapi/asm-generic/mman-common.h | 1 +
init/init_task.c | 3 +
kernel/entry/common.c | 44 +-
kernel/sched/core.c | 40 +-
lib/Kconfig.debug | 12 +
lib/Makefile | 3 +
lib/pks/Makefile | 3 +
lib/pks/pks_test.c | 692 ++++++++++++++++++
mm/Kconfig | 23 +
mm/memremap.c | 150 ++++
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/test_pks.c | 168 +++++
53 files changed, 2176 insertions(+), 108 deletions(-)
create mode 100644 arch/x86/include/asm/pkeys_common.h
create mode 100644 arch/x86/include/asm/pks.h
create mode 100644 include/linux/pks-keys.h
create mode 100644 lib/pks/Makefile
create mode 100644 lib/pks/pks_test.c
create mode 100644 tools/testing/selftests/x86/test_pks.c

--
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 01/44] entry: Create an internal irqentry_exit_cond_resched() call
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 02/44] Documentation/protection-keys: Clean up documentation for User Space pkeys ira.weiny
                   ` (42 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The call to irqentry_exit_cond_resched() was not properly being
overridden when called from xen_pv_evtchn_do_upcall().

Define __irqentry_exit_cond_resched() as the static call and place the
override logic in irqentry_exit_cond_resched().

Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Because this was found via code inspection and it does not actually fix
any seen bug I've not added a fixes tag.

But for reference:
Fixes: 40607ee97e4e ("preempt/dynamic: Provide irqentry_exit_cond_resched() static call")
---
 include/linux/entry-common.h |  5 ++++-
 kernel/entry/common.c        | 23 +++++++++++++--------
 kernel/sched/core.c          | 40 ++++++++++++++++++------------------
 3 files changed, 38 insertions(+), 30 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 2e2b8d6140ed..ddaffc983e62 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -455,10 +455,13 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
  * Conditional reschedule with additional sanity checks.
  */
 void irqentry_exit_cond_resched(void);
+
+void __irqentry_exit_cond_resched(void);
 #ifdef CONFIG_PREEMPT_DYNAMIC
-DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+DECLARE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
 #endif
 
+
 /**
  * irqentry_exit - Handle return from exception that used irqentry_enter()
  * @regs:	Pointer to pt_regs (exception entry regs)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bad713684c2e..490442a48332 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -380,7 +380,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 	return ret;
 }
 
-void irqentry_exit_cond_resched(void)
+void __irqentry_exit_cond_resched(void)
 {
 	if (!preempt_count()) {
 		/* Sanity check RCU and thread stack */
@@ -392,9 +392,20 @@ void irqentry_exit_cond_resched(void)
 	}
 }
 #ifdef CONFIG_PREEMPT_DYNAMIC
-DEFINE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+DEFINE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
 #endif
 
+void irqentry_exit_cond_resched(void)
+{
+	if (IS_ENABLED(CONFIG_PREEMPTION)) {
+#ifdef CONFIG_PREEMPT_DYNAMIC
+		static_call(__irqentry_exit_cond_resched)();
+#else
+		__irqentry_exit_cond_resched();
+#endif
+	}
+}
+
 noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 {
 	lockdep_assert_irqs_disabled();
@@ -420,13 +431,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		}
 
 		instrumentation_begin();
-		if (IS_ENABLED(CONFIG_PREEMPTION)) {
-#ifdef CONFIG_PREEMPT_DYNAMIC
-			static_call(irqentry_exit_cond_resched)();
-#else
-			irqentry_exit_cond_resched();
-#endif
-		}
+		irqentry_exit_cond_resched();
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
 		instrumentation_end();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 848eaa0efe0e..7197c33beb39 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6562,29 +6562,29 @@ EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
  * SC:might_resched
  * SC:preempt_schedule
  * SC:preempt_schedule_notrace
- * SC:irqentry_exit_cond_resched
+ * SC:__irqentry_exit_cond_resched
  *
  *
  * NONE:
- *   cond_resched               <- __cond_resched
- *   might_resched              <- RET0
- *   preempt_schedule           <- NOP
- *   preempt_schedule_notrace   <- NOP
- *   irqentry_exit_cond_resched <- NOP
+ *   cond_resched                 <- __cond_resched
+ *   might_resched                <- RET0
+ *   preempt_schedule             <- NOP
+ *   preempt_schedule_notrace     <- NOP
+ *   __irqentry_exit_cond_resched <- NOP
  *
  * VOLUNTARY:
- *   cond_resched               <- __cond_resched
- *   might_resched              <- __cond_resched
- *   preempt_schedule           <- NOP
- *   preempt_schedule_notrace   <- NOP
- *   irqentry_exit_cond_resched <- NOP
+ *   cond_resched                 <- __cond_resched
+ *   might_resched                <- __cond_resched
+ *   preempt_schedule             <- NOP
+ *   preempt_schedule_notrace     <- NOP
+ *   __irqentry_exit_cond_resched <- NOP
  *
  * FULL:
- *   cond_resched               <- RET0
- *   might_resched              <- RET0
- *   preempt_schedule           <- preempt_schedule
- *   preempt_schedule_notrace   <- preempt_schedule_notrace
- *   irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ *   cond_resched                 <- RET0
+ *   might_resched                <- RET0
+ *   preempt_schedule             <- preempt_schedule
+ *   preempt_schedule_notrace     <- preempt_schedule_notrace
+ *   __irqentry_exit_cond_resched <- __irqentry_exit_cond_resched
  */
 
 enum {
@@ -6620,7 +6620,7 @@ void sched_dynamic_update(int mode)
 	static_call_update(might_resched, __cond_resched);
 	static_call_update(preempt_schedule, __preempt_schedule_func);
 	static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
-	static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+	static_call_update(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
 
 	switch (mode) {
 	case preempt_dynamic_none:
@@ -6628,7 +6628,7 @@ void sched_dynamic_update(int mode)
 		static_call_update(might_resched, (void *)&__static_call_return0);
 		static_call_update(preempt_schedule, NULL);
 		static_call_update(preempt_schedule_notrace, NULL);
-		static_call_update(irqentry_exit_cond_resched, NULL);
+		static_call_update(__irqentry_exit_cond_resched, NULL);
 		pr_info("Dynamic Preempt: none\n");
 		break;
 
@@ -6637,7 +6637,7 @@ void sched_dynamic_update(int mode)
 		static_call_update(might_resched, __cond_resched);
 		static_call_update(preempt_schedule, NULL);
 		static_call_update(preempt_schedule_notrace, NULL);
-		static_call_update(irqentry_exit_cond_resched, NULL);
+		static_call_update(__irqentry_exit_cond_resched, NULL);
 		pr_info("Dynamic Preempt: voluntary\n");
 		break;
 
@@ -6646,7 +6646,7 @@ void sched_dynamic_update(int mode)
 		static_call_update(might_resched, (void *)&__static_call_return0);
 		static_call_update(preempt_schedule, __preempt_schedule_func);
 		static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
-		static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+		static_call_update(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
 		pr_info("Dynamic Preempt: full\n");
 		break;
 	}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 02/44] Documentation/protection-keys: Clean up documentation for User Space pkeys
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
  2022-01-27 17:54 ` [PATCH V8 01/44] entry: Create an internal irqentry_exit_cond_resched() call ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-28 22:39   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 03/44] x86/pkeys: Create pkeys_common.h ira.weiny
                   ` (41 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The documentation for user space pkeys was a bit dated including things
such as Amazon and distribution testing information which is irrelevant
now.

Update the documentation.  This also streamlines adding the Supervisor
Pkey documentation later on.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 Documentation/core-api/protection-keys.rst | 43 ++++++++++------------
 1 file changed, 20 insertions(+), 23 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..12331db474aa 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,31 +4,28 @@
 Memory Protection Keys
 ======================
 
-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
-
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
-
-Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains.  It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
-
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key.  Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
+Memory Protection Keys provide a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables when an
+application changes protection domains.
+
+PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
+Processor" Server CPUs and later.  And it will be available in future
+non-server Intel parts and future AMD processors.
+
+pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.
+
+Protections for each key are defined with a per-CPU user-accessible register
+(PKRU).  Each of these is a 32-bit register storing two bits (Access Disable
+and Write Disable) for each of 16 keys.
+
+Being a CPU register, PKRU is inherently thread-local, potentially giving each
 thread a different set of protections from every other thread.
 
-There are two new instructions (RDPKRU/WRPKRU) for reading and writing
-to the new register.  The feature is only available in 64-bit mode,
-even though there is theoretically space in the PAE PTEs.  These
-permissions are enforced on data access only and have no effect on
-instruction fetches.
+There are two instructions (RDPKRU/WRPKRU) for reading and writing to the
+register.  The feature is only available in 64-bit mode, even though there is
+theoretically space in the PAE PTEs.  These permissions are enforced on data
+access only and have no effect on instruction fetches.
 
 Syscalls
 ========
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 03/44] x86/pkeys: Create pkeys_common.h
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
  2022-01-27 17:54 ` [PATCH V8 01/44] entry: Create an internal irqentry_exit_cond_resched() call ira.weiny
  2022-01-27 17:54 ` [PATCH V8 02/44] Documentation/protection-keys: Clean up documentation for User Space pkeys ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-28 22:43   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 04/44] x86/pkeys: Add additional PKEY helper macros ira.weiny
                   ` (40 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in
similar fashions and can share common defines.  Specifically PKS and PKU
each have:

	1. A single control register
	2. The same number of keys
	3. The same number of bits in the register per key
	4. Access and Write disable in the same bit locations

Given the above, share all the macros that synthesize and manipulate
register values between the two features.  Share these defines by moving
them into a new header, change their names to reflect the common use,
and include the header where needed.

Also while editing the code remove the use of 'we' from comments being
touched.

NOTE the checkpatch errors are ignored for the init_pkru_value to
align the values in the code.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from v7:
	Rebased onto latest
---
 arch/x86/include/asm/pkeys_common.h | 11 +++++++++++
 arch/x86/include/asm/pkru.h         | 20 ++++++++------------
 arch/x86/kernel/fpu/xstate.c        | 10 +++++-----
 arch/x86/mm/pkeys.c                 | 14 ++++++--------
 4 files changed, 30 insertions(+), 25 deletions(-)
 create mode 100644 arch/x86/include/asm/pkeys_common.h

diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
new file mode 100644
index 000000000000..08c736669244
--- /dev/null
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKEYS_COMMON_H
+#define _ASM_X86_PKEYS_COMMON_H
+
+#define PKR_AD_BIT 0x1u
+#define PKR_WD_BIT 0x2u
+#define PKR_BITS_PER_PKEY 2
+
+#define PKR_AD_KEY(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+
+#endif /*_ASM_X86_PKEYS_COMMON_H */
diff --git a/arch/x86/include/asm/pkru.h b/arch/x86/include/asm/pkru.h
index 74f0a2d34ffd..06980dd42946 100644
--- a/arch/x86/include/asm/pkru.h
+++ b/arch/x86/include/asm/pkru.h
@@ -3,10 +3,7 @@
 #define _ASM_X86_PKRU_H
 
 #include <asm/cpufeature.h>
-
-#define PKRU_AD_BIT 0x1u
-#define PKRU_WD_BIT 0x2u
-#define PKRU_BITS_PER_PKEY 2
+#include <asm/pkeys_common.h>
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 extern u32 init_pkru_value;
@@ -18,18 +15,17 @@ extern u32 init_pkru_value;
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
-	return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
+
+	return !(pkru & (PKR_AD_BIT << pkru_pkey_bits));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
-	/*
-	 * Access-disable disables writes too so we need to check
-	 * both bits here.
-	 */
-	return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
+
+	/* Access-disable disables writes too so check both bits here. */
+	return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits));
 }
 
 static inline u32 read_pkru(void)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 02b3ddaf4f75..d8ddd306d225 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1089,19 +1089,19 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 	if (WARN_ON_ONCE(pkey >= arch_max_pkey()))
 		return -EINVAL;
 
-	/* Set the bits we need in PKRU:  */
+	/* Set the bits needed in PKRU:  */
 	if (init_val & PKEY_DISABLE_ACCESS)
-		new_pkru_bits |= PKRU_AD_BIT;
+		new_pkru_bits |= PKR_AD_BIT;
 	if (init_val & PKEY_DISABLE_WRITE)
-		new_pkru_bits |= PKRU_WD_BIT;
+		new_pkru_bits |= PKR_WD_BIT;
 
 	/* Shift the bits in to the correct place in PKRU for pkey: */
-	pkey_shift = pkey * PKRU_BITS_PER_PKEY;
+	pkey_shift = pkey * PKR_BITS_PER_PKEY;
 	new_pkru_bits <<= pkey_shift;
 
 	/* Get old PKRU and mask off any old bits in place: */
 	old_pkru = read_pkru();
-	old_pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift);
+	old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
 
 	/* Write old part along with new part: */
 	write_pkru(old_pkru | new_pkru_bits);
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index e44e938885b7..aa7042f272fb 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -110,19 +110,17 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot, int pkey
 	return vma_pkey(vma);
 }
 
-#define PKRU_AD_KEY(pkey)	(PKRU_AD_BIT << ((pkey) * PKRU_BITS_PER_PKEY))
-
 /*
  * Make the default PKRU value (at execve() time) as restrictive
  * as possible.  This ensures that any threads clone()'d early
  * in the process's lifetime will not accidentally get access
  * to data which is pkey-protected later on.
  */
-u32 init_pkru_value = PKRU_AD_KEY( 1) | PKRU_AD_KEY( 2) | PKRU_AD_KEY( 3) |
-		      PKRU_AD_KEY( 4) | PKRU_AD_KEY( 5) | PKRU_AD_KEY( 6) |
-		      PKRU_AD_KEY( 7) | PKRU_AD_KEY( 8) | PKRU_AD_KEY( 9) |
-		      PKRU_AD_KEY(10) | PKRU_AD_KEY(11) | PKRU_AD_KEY(12) |
-		      PKRU_AD_KEY(13) | PKRU_AD_KEY(14) | PKRU_AD_KEY(15);
+u32 init_pkru_value = PKR_AD_KEY( 1) | PKR_AD_KEY( 2) | PKR_AD_KEY( 3) |
+		      PKR_AD_KEY( 4) | PKR_AD_KEY( 5) | PKR_AD_KEY( 6) |
+		      PKR_AD_KEY( 7) | PKR_AD_KEY( 8) | PKR_AD_KEY( 9) |
+		      PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) |
+		      PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15);
 
 static ssize_t init_pkru_read_file(struct file *file, char __user *user_buf,
 			     size_t count, loff_t *ppos)
@@ -155,7 +153,7 @@ static ssize_t init_pkru_write_file(struct file *file,
 	 * up immediately if someone attempts to disable access
 	 * or writes to pkey 0.
 	 */
-	if (new_init_pkru & (PKRU_AD_BIT|PKRU_WD_BIT))
+	if (new_init_pkru & (PKR_AD_BIT|PKR_WD_BIT))
 		return -EINVAL;
 
 	WRITE_ONCE(init_pkru_value, new_init_pkru);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 04/44] x86/pkeys: Add additional PKEY helper macros
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (2 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 03/44] x86/pkeys: Create pkeys_common.h ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-28 22:47   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 05/44] x86/fpu: Refactor arch_set_user_pkey_access() ira.weiny
                   ` (39 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Avoid open coding shift and mask operations by defining and using helper
macros for PKey operations.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Move ahead of other patches.
	Simplify to only the macros used in the series
---
 arch/x86/include/asm/pkeys_common.h | 5 ++++-
 arch/x86/include/asm/pkru.h         | 8 ++------
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
index 08c736669244..d02ab5bc3fff 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -6,6 +6,9 @@
 #define PKR_WD_BIT 0x2u
 #define PKR_BITS_PER_PKEY 2
 
-#define PKR_AD_KEY(pkey)	(PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+#define PKR_PKEY_SHIFT(pkey)	(pkey * PKR_BITS_PER_PKEY)
+
+#define PKR_AD_KEY(pkey)	(PKR_AD_BIT << PKR_PKEY_SHIFT(pkey))
+#define PKR_WD_KEY(pkey)	(PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))
 
 #endif /*_ASM_X86_PKEYS_COMMON_H */
diff --git a/arch/x86/include/asm/pkru.h b/arch/x86/include/asm/pkru.h
index 06980dd42946..81ddf88ac3c9 100644
--- a/arch/x86/include/asm/pkru.h
+++ b/arch/x86/include/asm/pkru.h
@@ -15,17 +15,13 @@ extern u32 init_pkru_value;
 
 static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
-
-	return !(pkru & (PKR_AD_BIT << pkru_pkey_bits));
+	return !(pkru & PKR_AD_KEY(pkey));
 }
 
 static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
 {
-	int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
-
 	/* Access-disable disables writes too so check both bits here. */
-	return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits));
+	return !(pkru & (PKR_AD_KEY(pkey) | PKR_WD_KEY(pkey)));
 }
 
 static inline u32 read_pkru(void)
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 05/44] x86/fpu: Refactor arch_set_user_pkey_access()
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (3 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 04/44] x86/pkeys: Add additional PKEY helper macros ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-28 22:50   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS ira.weiny
                   ` (38 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Both PKU and PKS update their register values in the same way.  They can
therefore share the update code.

Define a helper, pkey_update_pkval(), which will be used to support both
Protection Key User (PKU) and the new Protection Key for Supervisor
(PKS) in subsequent patches.

pkey_update_pkval() contributed by Thomas

Co-developed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Update for V8:
	Replace the code Peter provided in update_pkey_reg() for
	Thomas' pkey_update_pkval()
		-- https://lore.kernel.org/lkml/20200717085442.GX10769@hirez.programming.kicks-ass.net/
---
 arch/x86/include/asm/pkeys.h |  2 ++
 arch/x86/kernel/fpu/xstate.c | 22 ++++------------------
 arch/x86/mm/pkeys.c          | 16 ++++++++++++++++
 3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 1d5f14aff5f6..cc4d4f552f9d 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -131,4 +131,6 @@ static inline int vma_pkey(struct vm_area_struct *vma)
 	return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
 }
 
+u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits);
+
 #endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d8ddd306d225..00d059db4106 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1071,8 +1071,7 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 			      unsigned long init_val)
 {
-	u32 old_pkru, new_pkru_bits = 0;
-	int pkey_shift;
+	u32 pkru;
 
 	/*
 	 * This check implies XSAVE support.  OSPKE only gets
@@ -1089,22 +1088,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 	if (WARN_ON_ONCE(pkey >= arch_max_pkey()))
 		return -EINVAL;
 
-	/* Set the bits needed in PKRU:  */
-	if (init_val & PKEY_DISABLE_ACCESS)
-		new_pkru_bits |= PKR_AD_BIT;
-	if (init_val & PKEY_DISABLE_WRITE)
-		new_pkru_bits |= PKR_WD_BIT;
-
-	/* Shift the bits in to the correct place in PKRU for pkey: */
-	pkey_shift = pkey * PKR_BITS_PER_PKEY;
-	new_pkru_bits <<= pkey_shift;
-
-	/* Get old PKRU and mask off any old bits in place: */
-	old_pkru = read_pkru();
-	old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
-
-	/* Write old part along with new part: */
-	write_pkru(old_pkru | new_pkru_bits);
+	pkru = read_pkru();
+	pkru = pkey_update_pkval(pkru, pkey, init_val);
+	write_pkru(pkru);
 
 	return 0;
 }
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index aa7042f272fb..cf12d8bf122b 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -190,3 +190,19 @@ static __init int setup_init_pkru(char *opt)
 	return 1;
 }
 __setup("init_pkru=", setup_init_pkru);
+
+/*
+ * Kernel users use the same flags as user space:
+ *     PKEY_DISABLE_ACCESS
+ *     PKEY_DISABLE_WRITE
+ */
+u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
+{
+	int shift = pkey * PKR_BITS_PER_PKEY;
+
+	if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
+		accessbits &= PKEY_ACCESS_MASK;
+
+	pkval &= ~(PKEY_ACCESS_MASK << shift);
+	return pkval | accessbits << shift;
+}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (4 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 05/44] x86/fpu: Refactor arch_set_user_pkey_access() ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-28 22:54   ` Dave Hansen
  2022-01-29  0:06   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 07/44] x86/pkeys: Add PKS CPU feature bit ira.weiny
                   ` (37 subsequent siblings)
  43 siblings, 2 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Protection Key Supervisor, PKS, is a feature used by kernel code only.
As such if no kernel users are configured the PKS code is unnecessary
overhead.

Define a Kconfig structure which allows kernel code to detect PKS
support by an architecture and then subsequently enable that support
within the architecture.

ARCH_HAS_SUPERVISOR_PKEYS indicates to kernel consumers that an
architecture supports pkeys.  PKS users can then select
ARCH_ENABLE_SUPERVISOR_PKEYS to turn on the support within the
architecture.

If ARCH_ENABLE_SUPERVISOR_PKEYS is not selected architectures avoid the
PKS overhead.

ARCH_ENABLE_SUPERVISOR_PKEYS remains off until the first kernel use case
sets it.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Split this out to a single change patch
---
 arch/x86/Kconfig | 1 +
 mm/Kconfig       | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ebe8fc76949a..a30fe85e27ac 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1867,6 +1867,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
 	depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
 	select ARCH_USES_HIGH_VMA_FLAGS
 	select ARCH_HAS_PKEYS
+	select ARCH_HAS_SUPERVISOR_PKEYS
 	help
 	  Memory Protection Keys provides a mechanism for enforcing
 	  page-based protections, but without requiring modification of the
diff --git a/mm/Kconfig b/mm/Kconfig
index 3326ee3903f3..46f2bb15aa4e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -804,6 +804,10 @@ config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+config ARCH_HAS_SUPERVISOR_PKEYS
+	bool
+config ARCH_ENABLE_SUPERVISOR_PKEYS
+	bool
 
 config PERCPU_STATS
 	bool "Collect percpu memory statistics"
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 07/44] x86/pkeys: Add PKS CPU feature bit
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (5 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-28 23:05   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 08/44] x86/fault: Adjust WARN_ON for PKey fault ira.weiny
                   ` (36 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings.  It uses the same mechanism of Protection Keys as those on
User mappings but applies that mechanism to supervisor mappings using a
supervisor specific MSR.

The CPU indicates support for PKS in bit 31 of the ECX register after a
cpuid instruction.

Add the defines for this bit and the boilerplate disable infrastructure
predicated on the Kconfig option.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Split this out into it's own patch
---
 arch/x86/include/asm/cpufeatures.h       | 1 +
 arch/x86/include/asm/disabled-features.h | 8 +++++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6db4e2932b3d..b917605e9915 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -370,6 +370,7 @@
 #define X86_FEATURE_MOVDIR64B		(16*32+28) /* MOVDIR64B instruction */
 #define X86_FEATURE_ENQCMD		(16*32+29) /* ENQCMD and ENQCMDS instructions */
 #define X86_FEATURE_SGX_LC		(16*32+30) /* Software Guard Extensions Launch Control */
+#define X86_FEATURE_PKS			(16*32+31) /* Protection Keys for Supervisor pages */
 
 /* AMD-defined CPU features, CPUID level 0x80000007 (EBX), word 17 */
 #define X86_FEATURE_OVERFLOW_RECOV	(17*32+ 0) /* MCA overflow recovery support */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 8f28fafa98b3..66fdad8f3941 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -44,6 +44,12 @@
 # define DISABLE_OSPKE		(1<<(X86_FEATURE_OSPKE & 31))
 #endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+# define DISABLE_PKS		0
+#else
+# define DISABLE_PKS		(1<<(X86_FEATURE_PKS & 31))
+#endif
+
 #ifdef CONFIG_X86_5LEVEL
 # define DISABLE_LA57	0
 #else
@@ -85,7 +91,7 @@
 #define DISABLED_MASK14	0
 #define DISABLED_MASK15	0
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
-			 DISABLE_ENQCMD)
+			 DISABLE_ENQCMD|DISABLE_PKS)
 #define DISABLED_MASK17	0
 #define DISABLED_MASK18	0
 #define DISABLED_MASK19	0
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 08/44] x86/fault: Adjust WARN_ON for PKey fault
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (6 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 07/44] x86/pkeys: Add PKS CPU feature bit ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-28 23:10   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 09/44] x86/pkeys: Enable PKS on cpus which support it ira.weiny
                   ` (35 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Previously if a Protection key fault occurred it indicated something
very wrong because user page mappings are not supposed to be in the
kernel address space.

Now PKey faults may happen on kernel mappings if the feature is enabled.

If PKS is enabled, avoid the warning in the fault path.

Cc: Sean Christopherson <seanjc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/mm/fault.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index d0074c6ed31a..6ed91b632eac 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1148,11 +1148,15 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 		   unsigned long address)
 {
 	/*
-	 * Protection keys exceptions only happen on user pages.  We
-	 * have no user pages in the kernel portion of the address
-	 * space, so do not expect them here.
+	 * X86_PF_PK (Protection key exceptions) may occur on kernel addresses
+	 * when PKS (PKeys Supervisor) is enabled.
+	 *
+	 * However, if PKS is not enabled WARN if this exception is seen
+	 * because there are no user pages in the kernel portion of the address
+	 * space.
 	 */
-	WARN_ON_ONCE(hw_error_code & X86_PF_PK);
+	WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS) &&
+		     (hw_error_code & X86_PF_PK));
 
 #ifdef CONFIG_X86_32
 	/*
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 09/44] x86/pkeys: Enable PKS on cpus which support it
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (7 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 08/44] x86/fault: Adjust WARN_ON for PKey fault ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-28 23:18   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 10/44] Documentation/pkeys: Add initial PKS documentation ira.weiny
                   ` (34 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings.  It uses the same mechanism of Protection Keys as those on
User mappings but applies that mechanism to supervisor mappings using a
supervisor specific MSR.

Bit 24 of CR4 is used to enable the feature by software.  Define
pks_setup() to be called when PKS is configured.

Initially, pks_setup() initializes the per-cpu MSR with 0 to enable all
access on all pkeys.  asm/pks.h is added as a new file to store new
internal functions and structures such as pks_setup().

Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Move setup_pks() into this patch with a default of all access
		for all pkeys.
	From Thomas
		s/setup_pks/pks_setup/
	Update Change log to better reflect exactly what this patch does.
---
 arch/x86/include/asm/msr-index.h            |  1 +
 arch/x86/include/asm/pks.h                  | 15 +++++++++++++++
 arch/x86/include/uapi/asm/processor-flags.h |  2 ++
 arch/x86/kernel/cpu/common.c                |  2 ++
 arch/x86/mm/pkeys.c                         | 16 ++++++++++++++++
 5 files changed, 36 insertions(+)
 create mode 100644 arch/x86/include/asm/pks.h

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3faf0f97edb1..fca56ca646a0 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -786,6 +786,7 @@
 
 #define MSR_IA32_TSC_DEADLINE		0x000006E0
 
+#define MSR_IA32_PKRS			0x000006E1
 
 #define MSR_TSX_FORCE_ABORT		0x0000010F
 
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
new file mode 100644
index 000000000000..8180fc59790b
--- /dev/null
+++ b/arch/x86/include/asm/pks.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKS_H
+#define _ASM_X86_PKS_H
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+void pks_setup(void);
+
+#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+static inline void pks_setup(void) { }
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+#endif /* _ASM_X86_PKS_H */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..191c574b2390 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
 #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_PKS_BIT		24 /* enable Protection Keys for Supervisor */
+#define X86_CR4_PKS		_BITUL(X86_CR4_PKS_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 7b8382c11788..83c1abce7d93 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -59,6 +59,7 @@
 #include <asm/cpu_device_id.h>
 #include <asm/uv/uv.h>
 #include <asm/sigframe.h>
+#include <asm/pks.h>
 
 #include "cpu.h"
 
@@ -1632,6 +1633,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 
 	x86_init_rdrand(c);
 	setup_pku(c);
+	pks_setup();
 
 	/*
 	 * Clear/Set all flags overridden by options, need do it
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index cf12d8bf122b..02629219e683 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -206,3 +206,19 @@ u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
 	pkval &= ~(PKEY_ACCESS_MASK << shift);
 	return pkval | accessbits << shift;
 }
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+/*
+ * PKS is independent of PKU and either or both may be supported on a CPU.
+ */
+void pks_setup(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	wrmsrl(MSR_IA32_PKRS, 0);
+	cr4_set_bits(X86_CR4_PKS);
+}
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 10/44] Documentation/pkeys: Add initial PKS documentation
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (8 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 09/44] x86/pkeys: Enable PKS on cpus which support it ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-28 23:57   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 11/44] mm/pkeys: Define static PKS key array and default values ira.weiny
                   ` (33 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Add initial overview and configuration information about PKS.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 Documentation/core-api/protection-keys.rst | 57 ++++++++++++++++++++--
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 12331db474aa..58670e3ee39e 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -12,6 +12,9 @@ PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
 Processor" Server CPUs and later.  And it will be available in future
 non-server Intel parts and future AMD processors.
 
+Protection Keys for Supervisor pages (PKS) is available in the SDM since May
+2020.
+
 pkeys work by dedicating 4 previously Reserved bits in each page table entry to
 a "protection key", giving 16 possible keys.
 
@@ -22,13 +25,20 @@ and Write Disable) for each of 16 keys.
 Being a CPU register, PKRU is inherently thread-local, potentially giving each
 thread a different set of protections from every other thread.
 
-There are two instructions (RDPKRU/WRPKRU) for reading and writing to the
-register.  The feature is only available in 64-bit mode, even though there is
+For Userspace (PKU), there are two instructions (RDPKRU/WRPKRU) for reading and
+writing to the register.
+
+For Supervisor (PKS), the register (MSR_IA32_PKRS) is accessible only to the
+kernel through rdmsr and wrmsr.
+
+The feature is only available in 64-bit mode, even though there is
 theoretically space in the PAE PTEs.  These permissions are enforced on data
 access only and have no effect on instruction fetches.
 
-Syscalls
-========
+
+
+Syscalls for user space keys
+============================
 
 There are 3 system calls which directly interact with pkeys::
 
@@ -95,3 +105,42 @@ with a read()::
 The kernel will send a SIGSEGV in both cases, but si_code will be set
 to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
 the plain mprotect() permissions are violated.
+
+
+Kernel API for PKS support
+==========================
+
+Overview
+--------
+
+Similar to user space pkeys, supervisor pkeys allow additional protections to
+be defined for a supervisor mappings.  Unlike user space pkeys, violations of
+these protections result in a kernel oops.
+
+Supervisor Memory Protection Keys (PKS) is a feature which is found on Intel's
+Sapphire Rapids (and later) "Scalable Processor" Server CPUs.  It will also be
+available in future non-server Intel parts.
+
+Also qemu has support as well: https://www.qemu.org/2021/04/30/qemu-6-0-0/
+
+Kconfig
+-------
+Kernel users intending to use PKS support should depend on
+ARCH_HAS_SUPERVISOR_PKEYS, and select ARCH_ENABLE_SUPERVISOR_PKEYS to turn on
+this support within the core.
+
+
+MSR details
+-----------
+
+It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing
+but still maintains ordering properties similar to WRPKRU.
+
+Older versions of the SDM on PKRS may be wrong with regard to this
+serialization.  The text should be the same as that of WRPKRU.  From the WRPKRU
+text:
+
+	WRPKRU will never execute transiently. Memory accesses
+	affected by PKRU register will not execute (even transiently)
+	until all prior executions of WRPKRU have completed execution
+	and updated the PKRU register.
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 11/44] mm/pkeys: Define static PKS key array and default values
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (9 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 10/44] Documentation/pkeys: Add initial PKS documentation ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-29  0:02   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 12/44] mm/pkeys: Define PKS page table macros ira.weiny
                   ` (32 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Kernel users will need a way to allocate a PKS Pkey for their use.

Introduce pks-keys.h as a place to define enum pks_pkey_consumers and
the macro PKS_INIT_VALUE.  PKS_INIT_VALUE holds the default value for
each key.  Kernel users reserve a key value by adding an entry to the
enum pks_pkey_consumers with a unique value [1-15] and replacing that
value in the PKS_INIT_VALUE macro using the desired default macro;
PKR_RW_KEY(), PKR_WD_KEY(), or PKR_AD_KEY().

Use this value to initialize all CPUs at boot.

pks-keys.h is added as a new header with minimal header dependencies.
This allows the use of PKS_INIT_VALUE within other headers where the
additional includes from pkeys.h caused major conflicts.  The main
conflict was using PKS_INIT_VALUE for INIT_TRHEAD in asm/processor.h

Add documentation.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Create pks-keys.h to solve header conflicts in subsequent
		patches.
	Remove create_initial_pkrs_value() which did not work
		Replace it with PKS_INIT_VALUE
		Fix up documentation to match
	s/PKR_RW_BIT/PKR_RW_KEY()/
	s/PKRS_INIT_VALUE/PKS_INIT_VALUE
	Split this off of the previous patch
	Update documentation and embed it in the code to help ensure it
	is kept up to date.

Changes for V7
	Create a dynamic pkrs_initial_value in early init code.
	Clean up comments
	Add comment to macro guard
---
 Documentation/core-api/protection-keys.rst |  4 ++
 arch/x86/include/asm/pkeys_common.h        |  1 +
 arch/x86/mm/pkeys.c                        |  2 +-
 include/linux/pkeys.h                      |  2 +
 include/linux/pks-keys.h                   | 59 ++++++++++++++++++++++
 5 files changed, 67 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/pks-keys.h

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 58670e3ee39e..af283a1a9aa0 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -129,6 +129,10 @@ Kernel users intending to use PKS support should depend on
 ARCH_HAS_SUPERVISOR_PKEYS, and select ARCH_ENABLE_SUPERVISOR_PKEYS to turn on
 this support within the core.
 
+PKS Key Allocation
+------------------
+.. kernel-doc:: include/linux/pks-keys.h
+        :doc: PKS_KEY_ALLOCATION
 
 MSR details
 -----------
diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
index d02ab5bc3fff..efb101dee3aa 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -8,6 +8,7 @@
 
 #define PKR_PKEY_SHIFT(pkey)	(pkey * PKR_BITS_PER_PKEY)
 
+#define PKR_RW_KEY(pkey)	(0          << PKR_PKEY_SHIFT(pkey))
 #define PKR_AD_KEY(pkey)	(PKR_AD_BIT << PKR_PKEY_SHIFT(pkey))
 #define PKR_WD_KEY(pkey)	(PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))
 
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 02629219e683..a5b5b86e97ce 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -217,7 +217,7 @@ void pks_setup(void)
 	if (!cpu_feature_enabled(X86_FEATURE_PKS))
 		return;
 
-	wrmsrl(MSR_IA32_PKRS, 0);
+	wrmsrl(MSR_IA32_PKRS, PKS_INIT_VALUE);
 	cr4_set_bits(X86_CR4_PKS);
 }
 
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 86be8bf27b41..e9ea8f152915 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -48,4 +48,6 @@ static inline bool arch_pkeys_enabled(void)
 
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
+#include <linux/pks-keys.h>
+
 #endif /* _LINUX_PKEYS_H */
diff --git a/include/linux/pks-keys.h b/include/linux/pks-keys.h
new file mode 100644
index 000000000000..05fe4a1cf888
--- /dev/null
+++ b/include/linux/pks-keys.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PKS_KEYS_H
+#define _LINUX_PKS_KEYS_H
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+#include <asm/pkeys_common.h>
+
+/**
+ * DOC: PKS_KEY_ALLOCATION
+ *
+ * Users reserve a key value by adding an entry to enum pks_pkey_consumers with
+ * a unique value from 1 to 15.  Then replacing that value in the
+ * PKS_INIT_VALUE macro using the desired default protection; PKR_RW_KEY(),
+ * PKR_WD_KEY(), or PKR_AD_KEY().
+ *
+ * PKS_KEY_DEFAULT must remain 0 key with a default of read/write to support
+ * non-pks protected pages.  Unused keys should be set (Access Disabled
+ * PKR_AD_KEY()).
+ *
+ * For example to configure a key for 'MY_FEATURE' with a default of Write
+ * Disabled.
+ *
+ * .. code-block:: c
+ *
+ *	enum pks_pkey_consumers
+ *	{
+ *		PKS_KEY_DEFAULT         = 0,
+ *		PKS_KEY_MY_FEATURE      = 1,
+ *	}
+ *
+ *	#define PKS_INIT_VALUE (PKR_RW_KEY(PKS_KEY_DEFAULT)		|
+ *				PKR_WD_KEY(PKS_KEY_MY_FEATURE)		|
+ *				PKR_AD_KEY(2)	| PKR_AD_KEY(3)		|
+ *				PKR_AD_KEY(4)	| PKR_AD_KEY(5)		|
+ *				PKR_AD_KEY(6)	| PKR_AD_KEY(7)		|
+ *				PKR_AD_KEY(8)	| PKR_AD_KEY(9)		|
+ *				PKR_AD_KEY(10)	| PKR_AD_KEY(11)	|
+ *				PKR_AD_KEY(12)	| PKR_AD_KEY(13)	|
+ *				PKR_AD_KEY(14)	| PKR_AD_KEY(15))
+ *
+ */
+enum pks_pkey_consumers {
+	PKS_KEY_DEFAULT		= 0, /* Must be 0 for default PTE values */
+};
+
+#define PKS_INIT_VALUE (PKR_RW_KEY(PKS_KEY_DEFAULT)		| \
+			PKR_AD_KEY(1)	| \
+			PKR_AD_KEY(2)	| PKR_AD_KEY(3)		| \
+			PKR_AD_KEY(4)	| PKR_AD_KEY(5)		| \
+			PKR_AD_KEY(6)	| PKR_AD_KEY(7)		| \
+			PKR_AD_KEY(8)	| PKR_AD_KEY(9)		| \
+			PKR_AD_KEY(10)	| PKR_AD_KEY(11)	| \
+			PKR_AD_KEY(12)	| PKR_AD_KEY(13)	| \
+			PKR_AD_KEY(14)	| PKR_AD_KEY(15))
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+#endif /* _LINUX_PKS_KEYS_H */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 12/44] mm/pkeys: Define PKS page table macros
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (10 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 11/44] mm/pkeys: Define static PKS key array and default values ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 13/44] mm/pkeys: Add initial PKS Test code ira.weiny
                   ` (31 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Fenghua Yu <fenghua.yu@intel.com>

Kernel users will need a way to assign their pkey to pages.

Define _PAGE_PKEY() and PAGE_KERNEL_PKEY() to allow users to set a pkey
on a PTE.

Add documentation.

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>

---
Changes for V8
	Split out from the 'Add PKS kernel API' patch
	Include documentation in this patch
---
 Documentation/core-api/protection-keys.rst |  7 +++++++
 arch/x86/include/asm/pgtable_types.h       | 22 ++++++++++++++++++++++
 include/linux/pgtable.h                    |  4 ++++
 3 files changed, 33 insertions(+)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index af283a1a9aa0..794b7dedc544 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -134,6 +134,13 @@ PKS Key Allocation
 .. kernel-doc:: include/linux/pks-keys.h
         :doc: PKS_KEY_ALLOCATION
 
+Adding Pages to a PKey protected domain
+---------------------------------------
+
+.. kernel-doc:: arch/x86/include/asm/pgtable_types.h
+        :doc: PKS_KEY_ASSIGNMENT
+
+
 MSR details
 -----------
 
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 40497a9020c6..e1d4535b525e 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -71,6 +71,22 @@
 			 _PAGE_PKEY_BIT2 | \
 			 _PAGE_PKEY_BIT3)
 
+/**
+ * DOC: PKS_KEY_ASSIGNMENT
+ *
+ * The following macros are used to set a pkey value in a supervisor PTE.
+ *
+ * .. code-block:: c
+ *
+ *         #define _PAGE_KEY(pkey)
+ *         #define PAGE_KERNEL_PKEY(pkey)
+ */
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+#define _PAGE_PKEY(pkey)	(_AT(pteval_t, pkey) << _PAGE_BIT_PKEY_BIT0)
+#else
+#define _PAGE_PKEY(pkey)	(_AT(pteval_t, 0))
+#endif
+
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED)
 #else
@@ -226,6 +242,12 @@ enum page_cache_mode {
 #define PAGE_KERNEL_IO		__pgprot_mask(__PAGE_KERNEL_IO)
 #define PAGE_KERNEL_IO_NOCACHE	__pgprot_mask(__PAGE_KERNEL_IO_NOCACHE)
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+#define PAGE_KERNEL_PKEY(pkey)	__pgprot_mask(__PAGE_KERNEL | _PAGE_PKEY(pkey))
+#else
+#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL
+#endif
+
 #endif	/* __ASSEMBLY__ */
 
 /*         xwr */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index bc8713a76e03..2864066e03ec 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1510,6 +1510,10 @@ static inline bool arch_has_pfn_modify_check(void)
 # define PAGE_KERNEL_EXEC PAGE_KERNEL
 #endif
 
+#ifndef PAGE_KERNEL_PKEY
+#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL
+#endif
+
 /*
  * Page Table Modification bits for pgtbl_mod_mask.
  *
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 13/44] mm/pkeys: Add initial PKS Test code
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (11 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 12/44] mm/pkeys: Define PKS page table macros ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-31 19:30   ` Edgecombe, Rick P
  2022-01-27 17:54 ` [PATCH V8 14/44] x86/pkeys: Introduce pks_write_pkrs() ira.weiny
                   ` (30 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The core PKS functionality provides an interface for kernel users to
reserve a key and set up a mapping with that key.

Define test code under CONFIG_PKS_TEST which allows the testing of the
enablement of PKS functionality, basic setting of a page with a pkey,
and ensures all defaults are set properly.

Assign a pkey to the test code.  While this test does waste a pkey value
this should not be a problem while there remains a very limited numbers
of potential pkey users.  If pkeys are exhausted in the future the test
can be made exclusive or shared with another user.

Operation is simple.  A test is requested by echo'ing the number of the
test into the debugfs file.  The result of the last test is reported by
reading the file.

	$ echo 0 > /sys/kernel/debug/x86/run_pks
	$ cat /sys/kernel/debug/x86/run_pks
	PASS

Two initial tests are created.  One to check that the default values
have been properly assigned and a second which purposely causes a fault.

Add documentation.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Ensure that unknown tests are flagged as failures.
	Split out the various tests into their own patches which test
		the functionality as the series goes.
	Move this basic test forward in the series

Changes for V7
	Add testing for pks_abandon_protections()
	Adjust pkrs_init_value
	Adjust for new defines
	Clean up comments
        Adjust test for static allocation of pkeys
        Use lookup_address() instead of follow_pte()
		follow_pte only works on IO and raw PFN mappings, use
		lookup_address() instead.  lookup_address() is
		constrained to architectures which support it.
---
 Documentation/core-api/protection-keys.rst |   8 +
 include/linux/pks-keys.h                   |   3 +-
 lib/Kconfig.debug                          |  12 ++
 lib/Makefile                               |   3 +
 lib/pks/Makefile                           |   3 +
 lib/pks/pks_test.c                         | 214 +++++++++++++++++++++
 6 files changed, 242 insertions(+), 1 deletion(-)
 create mode 100644 lib/pks/Makefile
 create mode 100644 lib/pks/pks_test.c

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 794b7dedc544..234122e56a92 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -155,3 +155,11 @@ text:
 	affected by PKRU register will not execute (even transiently)
 	until all prior executions of WRPKRU have completed execution
 	and updated the PKRU register.
+
+Testing
+-------
+
+Example code can be found in lib/pks/pks_test.c
+
+.. kernel-doc:: lib/pks/pks_test.c
+        :doc: PKS_TEST
diff --git a/include/linux/pks-keys.h b/include/linux/pks-keys.h
index 05fe4a1cf888..69a0be979515 100644
--- a/include/linux/pks-keys.h
+++ b/include/linux/pks-keys.h
@@ -42,10 +42,11 @@
  */
 enum pks_pkey_consumers {
 	PKS_KEY_DEFAULT		= 0, /* Must be 0 for default PTE values */
+	PKS_KEY_TEST		= 1,
 };
 
 #define PKS_INIT_VALUE (PKR_RW_KEY(PKS_KEY_DEFAULT)		| \
-			PKR_AD_KEY(1)	| \
+			PKR_AD_KEY(PKS_KEY_TEST)	| \
 			PKR_AD_KEY(2)	| PKR_AD_KEY(3)		| \
 			PKR_AD_KEY(4)	| PKR_AD_KEY(5)		| \
 			PKR_AD_KEY(6)	| PKR_AD_KEY(7)		| \
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 14b89aa37c5c..5cab2100c133 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2685,6 +2685,18 @@ config HYPERV_TESTING
 	help
 	  Select this option to enable Hyper-V vmbus testing.
 
+config PKS_TEST
+	bool "PKey (S)upervisor testing"
+	depends on ARCH_HAS_SUPERVISOR_PKEYS
+	select ARCH_ENABLE_SUPERVISOR_PKEYS
+	help
+	  Select this option to enable testing of PKS core software and
+	  hardware.
+
+	  Answer N if you don't know what supervisor keys are.
+
+	  If unsure, say N.
+
 endmenu # "Kernel Testing and Coverage"
 
 source "Documentation/Kconfig"
diff --git a/lib/Makefile b/lib/Makefile
index 300f569c626b..038a93c89714 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -398,3 +398,6 @@ $(obj)/$(TEST_FORTIFY_LOG): $(addprefix $(obj)/, $(TEST_FORTIFY_LOGS)) FORCE
 ifeq ($(CONFIG_FORTIFY_SOURCE),y)
 $(obj)/string.o: $(obj)/$(TEST_FORTIFY_LOG)
 endif
+
+# PKS test
+obj-y += pks/
diff --git a/lib/pks/Makefile b/lib/pks/Makefile
new file mode 100644
index 000000000000..9daccba4f7c4
--- /dev/null
+++ b/lib/pks/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_PKS_TEST) += pks_test.o
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
new file mode 100644
index 000000000000..159576dda47c
--- /dev/null
+++ b/lib/pks/pks_test.c
@@ -0,0 +1,214 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright(c) 2021 Intel Corporation. All rights reserved.
+ */
+
+/**
+ * DOC: PKS_TEST
+ *
+ * If CONFIG_PKS_TEST is enabled a debugfs file is created to initiate in
+ * kernel testing.  These can be triggered by:
+ *
+ * $ echo X > /sys/kernel/debug/x86/run_pks
+ *
+ * where X is:
+ *
+ * * 0  Loop through all CPUs, report the msr, and check against the default.
+ * * 9  Set up and fault on a PKS protected page.
+ *
+ * NOTE: 9 will fault on purpose.  Therefore, it requires the option to be
+ * specified 2 times in a row to ensure the intent to run it.
+ *
+ * $ cat /sys/kernel/debug/x86/run_pks
+ *
+ * Will print the result of the last test.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/debugfs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/pkeys.h>
+
+#define PKS_TEST_MEM_SIZE (PAGE_SIZE)
+
+#define CHECK_DEFAULTS		0
+#define RUN_CRASH_TEST		9
+
+static struct dentry *pks_test_dentry;
+static bool crash_armed;
+
+static bool last_test_pass;
+
+struct pks_test_ctx {
+	int pkey;
+	char data[64];
+};
+
+static void *alloc_test_page(int pkey)
+{
+	return __vmalloc_node_range(PKS_TEST_MEM_SIZE, 1, VMALLOC_START, VMALLOC_END,
+				    GFP_KERNEL, PAGE_KERNEL_PKEY(pkey), 0,
+				    NUMA_NO_NODE, __builtin_return_address(0));
+}
+
+static struct pks_test_ctx *alloc_ctx(u8 pkey)
+{
+	struct pks_test_ctx *ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+
+	if (!ctx) {
+		pr_err("Failed to allocate memory for test context\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	ctx->pkey = pkey;
+	sprintf(ctx->data, "%s", "DEADBEEF");
+	return ctx;
+}
+
+static void free_ctx(struct pks_test_ctx *ctx)
+{
+	kfree(ctx);
+}
+
+static void crash_it(void)
+{
+	struct pks_test_ctx *ctx;
+	void *ptr;
+
+	pr_warn("     ***** BEGIN: Unhandled fault test *****\n");
+
+	ctx = alloc_ctx(PKS_KEY_TEST);
+	if (IS_ERR(ctx)) {
+		pr_err("Failed to allocate context???\n");
+		return;
+	}
+
+	ptr = alloc_test_page(ctx->pkey);
+	if (!ptr) {
+		pr_err("Failed to vmalloc page???\n");
+		return;
+	}
+
+	/* This purposely faults */
+	memcpy(ptr, ctx->data, 8);
+
+	/* Should never get here if so the test failed */
+	last_test_pass = false;
+
+	vfree(ptr);
+	free_ctx(ctx);
+}
+
+static void check_pkey_settings(void *data)
+{
+	unsigned long long msr = 0;
+	unsigned int cpu = smp_processor_id();
+
+	rdmsrl(MSR_IA32_PKRS, msr);
+	if (msr != PKS_INIT_VALUE) {
+		pr_err("cpu %d value incorrect : 0x%llx expected 0x%x\n",
+			cpu, msr, PKS_INIT_VALUE);
+		last_test_pass = false;
+	}
+}
+
+static void arm_or_run_crash_test(void)
+{
+	/*
+	 * WARNING: Test "9" will crash.
+	 *
+	 * Arm the test and print a warning.  A second "9" will run the test.
+	 */
+	if (!crash_armed) {
+		pr_warn("CAUTION: The crash test will cause an oops.\n");
+		pr_warn("         Specify 9 a second time to run\n");
+		pr_warn("         run any other test to clear\n");
+		crash_armed = true;
+		return;
+	}
+
+	crash_it();
+	crash_armed = false;
+}
+
+static ssize_t pks_read_file(struct file *file, char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	char buf[64];
+	unsigned int len;
+
+	len = sprintf(buf, "%s\n", last_test_pass ? "PASS" : "FAIL");
+
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
+			      size_t count, loff_t *ppos)
+{
+	int rc;
+	long option;
+	char buf[2];
+
+	if (copy_from_user(buf, user_buf, 1)) {
+		last_test_pass = false;
+		return -EFAULT;
+	}
+	buf[1] = '\0';
+
+	rc = kstrtol(buf, 0, &option);
+	if (rc) {
+		last_test_pass = false;
+		return count;
+	}
+
+	last_test_pass = true;
+
+	switch (option) {
+	case RUN_CRASH_TEST:
+		arm_or_run_crash_test();
+		goto skip_arm_clearing;
+	case CHECK_DEFAULTS:
+		on_each_cpu(check_pkey_settings, NULL, 1);
+		break;
+	default:
+		last_test_pass = false;
+		break;
+	}
+
+	/* Clear arming on any test run */
+	crash_armed = false;
+
+skip_arm_clearing:
+	return count;
+}
+
+static int pks_release_file(struct inode *inode, struct file *file)
+{
+	return 0;
+}
+
+static const struct file_operations fops_init_pks = {
+	.read = pks_read_file,
+	.write = pks_write_file,
+	.llseek = default_llseek,
+	.release = pks_release_file,
+};
+
+static int __init pks_test_init(void)
+{
+	if (cpu_feature_enabled(X86_FEATURE_PKS))
+		pks_test_dentry = debugfs_create_file("run_pks", 0600, arch_debugfs_dir,
+						      NULL, &fops_init_pks);
+
+	return 0;
+}
+late_initcall(pks_test_init);
+
+static void __exit pks_test_exit(void)
+{
+	debugfs_remove(pks_test_dentry);
+	pr_info("test exit\n");
+}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 14/44] x86/pkeys: Introduce pks_write_pkrs()
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (12 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 13/44] mm/pkeys: Add initial PKS Test code ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-29  0:12   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 15/44] x86/pkeys: Preserve the PKS MSR on context switch ira.weiny
                   ` (29 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Writing to MSR's is inefficient.  Even though the underlying
WRMSR(MSR_IA32_PKRS) is not serializing (see below), writing to the MSR
unnecessarily should be avoided.  This is especially true when the value
of the PKS protections is unlikely to change from the default often.

Introduce pks_write_pkrs() which avoids writing the MSR if the pkrs
value has not changed for the CPU.  Do this by utilizing a per-cpu
cache.  Protect the use of the cached value from preemption by
restricting the use of pks_write_pkrs() to non-preemptable context.
Further restrict it's use to callers which have checked X86_FEATURE_PKS.

The initial value of the MSR is preserved on INIT.  While unlikely, the
PKS_INIT_VALUE may be 0 someday which would prevent pks_write_pkrs()
from updating the MSR.  Keep the MSR write in pks_setup() to ensure the
MSR is initialized at least one time.  Then call pks_write_pkrs() to set
up the per-cache value to ensure it is in sync with the MSR.

It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
serializing but still maintains ordering properties similar to WRPKRU.
The current SDM section on PKRS needs updating but should be the same as
that of WRPKRU.  So to quote from the WRPKRU text:

	WRPKRU will never execute transiently. Memory accesses affected
	by PKRU register will not execute (even transiently) until all
	prior executions of WRPKRU have completed execution and updated
	the PKRU register.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	From Thomas
		Remove get/put_cpu_ptr() and make this a 'lower level
		call.  This makes it preemption unsafe but it is called
		mostly where preemption is already disabled.  Add this
		as a predicate of the call and those calls which need to
		can disable preemption.
		Add lockdep assert for preemption
	Ensure MSR gets written even if the PKS_INIT_VALUE is 0.
	Completely re-write the commit message.
	s/write_pkrs/pks_write_pkrs/
	Split this off into a singular patch

Changes for V7
	Create a dynamic pkrs_initial_value in early init code.
	Clean up comments
	Add comment to macro guard
---
 arch/x86/mm/pkeys.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index a5b5b86e97ce..3dce99ef4127 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -209,15 +209,56 @@ u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
 
 #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
 
+static DEFINE_PER_CPU(u32, pkrs_cache);
+
+/*
+ * pks_write_pkrs() - Write the pkrs of the current CPU
+ * @new_pkrs: New value to write to the current CPU register
+ *
+ * Optimizes the MSR writes by maintaining a per cpu cache.
+ *
+ * Context: must be called with preemption disabled
+ * Context: must only be called if PKS is enabled
+ *
+ * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
+ * serializing but still maintains ordering properties similar to WRPKRU.
+ * The current SDM section on PKRS needs updating but should be the same as
+ * that of WRPKRU.  Quote from the WRPKRU text:
+ *
+ *     WRPKRU will never execute transiently. Memory accesses
+ *     affected by PKRU register will not execute (even transiently)
+ *     until all prior executions of WRPKRU have completed execution
+ *     and updated the PKRU register.
+ */
+static inline void pks_write_pkrs(u32 new_pkrs)
+{
+	u32 pkrs = __this_cpu_read(pkrs_cache);
+
+	lockdep_assert_preemption_disabled();
+
+	if (pkrs != new_pkrs) {
+		__this_cpu_write(pkrs_cache, new_pkrs);
+		wrmsrl(MSR_IA32_PKRS, new_pkrs);
+	}
+}
+
 /*
  * PKS is independent of PKU and either or both may be supported on a CPU.
+ *
+ * Context: must be called with preemption disabled
  */
 void pks_setup(void)
 {
 	if (!cpu_feature_enabled(X86_FEATURE_PKS))
 		return;
 
+	/*
+	 * If the PKS_INIT_VALUE is 0 then pks_write_pkrs() could fail to
+	 * initialize the MSR.  Do a single write here to ensure the MSR is
+	 * written at least one time.
+	 */
 	wrmsrl(MSR_IA32_PKRS, PKS_INIT_VALUE);
+	pks_write_pkrs(PKS_INIT_VALUE);
 	cr4_set_bits(X86_CR4_PKS);
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 15/44] x86/pkeys: Preserve the PKS MSR on context switch
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (13 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 14/44] x86/pkeys: Introduce pks_write_pkrs() ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-29  0:22   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 16/44] mm/pkeys: Introduce pks_mk_readwrite() ira.weiny
                   ` (28 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The PKS MSR (PKRS) is defined as a per-logical-processor register.  This
isolates memory access by logical CPU.  Unfortunately, the MSR is not
managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
context switch.

Define pks_saved_pkrs in struct thread_struct.  Initialize all tasks,
including the init_task, with the PKS_INIT_VALUE when created.  Restore
the CPU's MSR to the saved task value on schedule in.

pks_write_current() is added to ensures non-supervisor pkey
configurations compile correctly without pks_saved_pkrs in thread_struct
as well as ensuring CPUs without PKS support are ignored.

NOTE The value of pks_saved_pkrs does not change with this patch.  That
is left for future patches.

Co-developed-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	From Thomas
		Ensure pkrs_write_current() does not suffer the overhead
		of preempt disable.
		Fix setting of initial value
		Remove flawed and broken create_initial_pkrs_value() in
			favor of a much simpler and robust macro default
		Update function names to be consistent.

	s/pkrs_write_current/pks_write_current
		This is a more consistent name
	s/saved_pkrs/pks_saved_pkrs
	s/pkrs_init_value/PKS_INIT_VALUE
	Remove pks_init_task()
		This function was added mainly to avoid the header file
		issue.  Adding pks-keys.h solved that and saves the
		complexity.

Changes for V7
	Move definitions from asm/processor.h to asm/pks.h
	s/INIT_PKRS_VALUE/pkrs_init_value
	Change pks_init_task()/pks_sched_in() to functions
	s/pks_sched_in/pks_write_current to be used more generically
	later in the series
---
 arch/x86/include/asm/pks.h       |  2 ++
 arch/x86/include/asm/processor.h | 17 ++++++++++++++++-
 arch/x86/kernel/process_64.c     |  3 +++
 arch/x86/mm/pkeys.c              | 13 +++++++++++++
 4 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index 8180fc59790b..d211bf36492c 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -5,10 +5,12 @@
 #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
 
 void pks_setup(void);
+void pks_write_current(void);
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 static inline void pks_setup(void) { }
+static inline void pks_write_current(void) { }
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2c5f12ae7d04..3530a0e50b4f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_X86_PROCESSOR_H
 #define _ASM_X86_PROCESSOR_H
 
+#include <linux/pks-keys.h>
+
 #include <asm/processor-flags.h>
 
 /* Forward declaration, a strange C thing */
@@ -502,6 +504,12 @@ struct thread_struct {
 	unsigned long		cr2;
 	unsigned long		trap_nr;
 	unsigned long		error_code;
+
+#ifdef	CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+	/* Saved Protection key register for supervisor mappings */
+	u32			pks_saved_pkrs;
+#endif
+
 #ifdef CONFIG_VM86
 	/* Virtual 86 mode info */
 	struct vm86		*vm86;
@@ -769,7 +777,14 @@ static inline void spin_lock_prefetch(const void *x)
 #define KSTK_ESP(task)		(task_pt_regs(task)->sp)
 
 #else
-#define INIT_THREAD { }
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+#define INIT_THREAD  {					\
+	.pks_saved_pkrs = PKS_INIT_VALUE,		\
+}
+#else
+#define INIT_THREAD  { }
+#endif
 
 extern unsigned long KSTK_ESP(struct task_struct *task);
 
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3402edec236c..81fc0b638308 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -59,6 +59,7 @@
 /* Not included via unistd.h */
 #include <asm/unistd_32_ia32.h>
 #endif
+#include <asm/pks.h>
 
 #include "process.h"
 
@@ -657,6 +658,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	/* Load the Intel cache allocation PQR MSR. */
 	resctrl_sched_in();
 
+	pks_write_current();
+
 	return prev_p;
 }
 
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 3dce99ef4127..6d94dfc9a219 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -242,6 +242,19 @@ static inline void pks_write_pkrs(u32 new_pkrs)
 	}
 }
 
+/**
+ * pks_write_current() - Write the current thread's saved PKRS value
+ *
+ * Context: must be called with preemption disabled
+ */
+void pks_write_current(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	pks_write_pkrs(current->thread.pks_saved_pkrs);
+}
+
 /*
  * PKS is independent of PKU and either or both may be supported on a CPU.
  *
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 16/44] mm/pkeys: Introduce pks_mk_readwrite()
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (14 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 15/44] x86/pkeys: Preserve the PKS MSR on context switch ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-31 23:10   ` Edgecombe, Rick P
  2022-02-01 17:40   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 17/44] mm/pkeys: Introduce pks_mk_noaccess() ira.weiny
                   ` (27 subsequent siblings)
  43 siblings, 2 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

When a user needs valid access to a PKS protected page they will need to
change the protections for their pkey to Read/Write within a thread of
execution.

Define pks_mk_readwrite() to update the specified Pkey.  Define
pks_update_protection() as a helper to do the heavy lifting and to allow
for subsequent pks_mk_*() calls.  Define PKEY_READ_WRITE rather than use
a magic value of '0' in pks_update_protection().  Finally, ensure
preemption is disabled while calling pks_write_pkrs() in this code.

Add documentation.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Define PKEY_READ_WRITE
	Make the call inline
	Clean up the names
	Use pks_write_pkrs() with preemption disabled
	Split this out from 'Add PKS kernel API'
	Include documentation in this patch
---
 Documentation/core-api/protection-keys.rst |  9 ++++++-
 arch/x86/mm/pkeys.c                        | 28 ++++++++++++++++++++++
 include/linux/pkeys.h                      | 25 +++++++++++++++++++
 include/uapi/asm-generic/mman-common.h     |  1 +
 4 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 234122e56a92..e4a27b93f3d4 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -141,11 +141,18 @@ Adding Pages to a PKey protected domain
         :doc: PKS_KEY_ASSIGNMENT
 
 
+Changing permissions of individual keys
+---------------------------------------
+
+.. kernel-doc:: include/linux/pks-keys.h
+        :identifiers: pks_mk_readwrite
+
 MSR details
 -----------
 
 It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing
-but still maintains ordering properties similar to WRPKRU.
+but still maintains ordering properties similar to WRPKRU.  Thus it is safe to
+immediately use a mapping when the pks_mk*() functions return.
 
 Older versions of the SDM on PKRS may be wrong with regard to this
 serialization.  The text should be the same as that of WRPKRU.  From the WRPKRU
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 6d94dfc9a219..7c6498fb8f8d 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -10,6 +10,7 @@
 
 #include <asm/cpufeature.h>             /* boot_cpu_has, ...            */
 #include <asm/mmu_context.h>            /* vma_pkey()                   */
+#include <asm/pks.h>
 
 int __execute_only_pkey(struct mm_struct *mm)
 {
@@ -275,4 +276,31 @@ void pks_setup(void)
 	cr4_set_bits(X86_CR4_PKS);
 }
 
+/*
+ * Do not call this directly, see pks_mk*().
+ *
+ * @pkey: Key for the domain to change
+ * @protection: protection bits to be used
+ *
+ * Protection utilizes the same protection bits specified for User pkeys
+ *     PKEY_DISABLE_ACCESS
+ *     PKEY_DISABLE_WRITE
+ *
+ */
+void pks_update_protection(int pkey, u32 protection)
+{
+	u32 pkrs;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	pkrs = current->thread.pks_saved_pkrs;
+	current->thread.pks_saved_pkrs = pkey_update_pkval(pkrs, pkey,
+							   protection);
+	preempt_disable();
+	pks_write_pkrs(current->thread.pks_saved_pkrs);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(pks_update_protection);
+
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index e9ea8f152915..73b554b99123 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -48,6 +48,31 @@ static inline bool arch_pkeys_enabled(void)
 
 #endif /* ! CONFIG_ARCH_HAS_PKEYS */
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
 #include <linux/pks-keys.h>
+#include <linux/types.h>
+
+#include <uapi/asm-generic/mman-common.h>
+
+void pks_update_protection(int pkey, u32 protection);
+
+/**
+ * pks_mk_readwrite() - Make the domain Read/Write
+ * @pkey: the pkey for which the access should change.
+ *
+ * Allow all access, read and write, to the domain specified by pkey.  This is
+ * not a global update and only affects the current running thread.
+ */
+static inline void pks_mk_readwrite(int pkey)
+{
+	pks_update_protection(pkey, PKEY_READ_WRITE);
+}
+
+#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+static inline void pks_mk_readwrite(int pkey) {}
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 1567a3294c3d..3da6ac9e5ded 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -78,6 +78,7 @@
 /* compatibility flags */
 #define MAP_FILE	0
 
+#define PKEY_READ_WRITE		0x0
 #define PKEY_DISABLE_ACCESS	0x1
 #define PKEY_DISABLE_WRITE	0x2
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 17/44] mm/pkeys: Introduce pks_mk_noaccess()
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (15 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 16/44] mm/pkeys: Introduce pks_mk_readwrite() ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 18/44] x86/fault: Add a PKS test fault hook ira.weiny
                   ` (26 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

After a valid access for a PKS protected page Users will need to change
the protections back to No Access for their Pkey

Define pks_mk_noaccess() to update the specified Pkey

Add documentation.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Make the call inline
	Split this patch out from 'Add PKS kernel API'
	Include documentation in this patch
---
 Documentation/core-api/protection-keys.rst |  2 +-
 include/linux/pkeys.h                      | 13 +++++++++++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index e4a27b93f3d4..115afc67153f 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -145,7 +145,7 @@ Changing permissions of individual keys
 ---------------------------------------
 
 .. kernel-doc:: include/linux/pks-keys.h
-        :identifiers: pks_mk_readwrite
+        :identifiers: pks_mk_readwrite pks_mk_noaccess
 
 MSR details
 -----------
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 73b554b99123..5f4965f5449b 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -57,6 +57,18 @@ static inline bool arch_pkeys_enabled(void)
 
 void pks_update_protection(int pkey, u32 protection);
 
+/**
+ * pks_mk_noaccess() - Disable all access to the domain
+ * @pkey: the pkey for which the access should change.
+ *
+ * Disable all access to the domain specified by pkey.  This is not a global
+ * update and only affects the current running thread.
+ */
+static inline void pks_mk_noaccess(int pkey)
+{
+	pks_update_protection(pkey, PKEY_DISABLE_ACCESS);
+}
+
 /**
  * pks_mk_readwrite() - Make the domain Read/Write
  * @pkey: the pkey for which the access should change.
@@ -71,6 +83,7 @@ static inline void pks_mk_readwrite(int pkey)
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
+static inline void pks_mk_noaccess(int pkey) {}
 static inline void pks_mk_readwrite(int pkey) {}
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 18/44] x86/fault: Add a PKS test fault hook
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (16 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 17/44] mm/pkeys: Introduce pks_mk_noaccess() ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-31 19:56   ` Edgecombe, Rick P
  2022-01-27 17:54 ` [PATCH V8 19/44] mm/pkeys: PKS Testing, add pks_mk_*() tests ira.weiny
                   ` (25 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The PKS test code is going to purposely create faults when testing
invalid access.  It will need a way to flag those faults as invalid and
keep the kernel running properly.

Create a hook in the fault handler to call back into the test code such
that the test code can track when a test it runs results in a fault.

The hook returns if the fault was caused by the test code so the main
handler can consider the fault handled.  Also the hook is responsible to
clear up the reason for the fault.

Predicate the hook on CONFIG_PKS_TEST.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 arch/x86/include/asm/pks.h | 14 ++++++++++++++
 arch/x86/mm/fault.c        | 30 ++++++++++++++++++++----------
 lib/pks/pks_test.c         | 12 ++++++++++++
 3 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index d211bf36492c..ee9fff5b4b13 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -14,4 +14,18 @@ static inline void pks_write_current(void) { }
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
+
+#ifdef CONFIG_PKS_TEST
+
+bool pks_test_callback(void);
+
+#else /* !CONFIG_PKS_TEST */
+
+static inline bool pks_test_callback(void)
+{
+	return false;
+}
+
+#endif /* CONFIG_PKS_TEST */
+
 #endif /* _ASM_X86_PKS_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6ed91b632eac..bef879943260 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -33,6 +33,7 @@
 #include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
 #include <asm/vdso.h>			/* fixup_vdso_exception()	*/
 #include <asm/irq_stack.h>
+#include <asm/pks.h>
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -1147,16 +1148,25 @@ static void
 do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 		   unsigned long address)
 {
-	/*
-	 * X86_PF_PK (Protection key exceptions) may occur on kernel addresses
-	 * when PKS (PKeys Supervisor) is enabled.
-	 *
-	 * However, if PKS is not enabled WARN if this exception is seen
-	 * because there are no user pages in the kernel portion of the address
-	 * space.
-	 */
-	WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS) &&
-		     (hw_error_code & X86_PF_PK));
+	if (hw_error_code & X86_PF_PK) {
+		/*
+		 * X86_PF_PK (Protection key exceptions) may occur on kernel
+		 * addresses when PKS (PKeys Supervisor) is enabled.
+		 *
+		 * However, if PKS is not enabled WARN if this exception is
+		 * seen because there are no user pages in the kernel portion
+		 * of the address space.
+		 */
+		WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS));
+
+		/*
+		 * If a protection key exception occurs it could be because a PKS test
+		 * is running.  If so, pks_test_callback() will clear the protection
+		 * mechanism and return true to indicate the fault was handled.
+		 */
+		if (pks_test_callback())
+			return;
+	}
 
 #ifdef CONFIG_X86_32
 	/*
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 159576dda47c..d84ab6e7a09c 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -47,6 +47,18 @@ struct pks_test_ctx {
 	char data[64];
 };
 
+/*
+ * pks_test_callback() is called by the fault handler to indicate it saw a PKey
+ * fault.
+ *
+ * NOTE: The callback is responsible for clearing any condition which would
+ * cause the fault to re-trigger.
+ */
+bool pks_test_callback(void)
+{
+	return false;
+}
+
 static void *alloc_test_page(int pkey)
 {
 	return __vmalloc_node_range(PKS_TEST_MEM_SIZE, 1, VMALLOC_START, VMALLOC_END,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 19/44] mm/pkeys: PKS Testing, add pks_mk_*() tests
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (17 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 18/44] x86/fault: Add a PKS test fault hook ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-02-01 17:45   ` Dave Hansen
  2022-01-27 17:54 ` [PATCH V8 20/44] mm/pkeys: Add PKS test for context switching ira.weiny
                   ` (24 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Create a test which runs through both read and writes on each of the 2
modes a PKS pkey can be set to, no access and read write.

First fill out pks_test_callback() to track fault count and make the
test key read write to ensure the fault does not trigger again.

Second verify that the pkey was properly set in the PTE.

Then add the test itself which iterates each of the test cases.

	PKS_TEST_NO_ACCESS,	WRITE,	FAULT_EXPECTED
	PKS_TEST_NO_ACCESS,	READ,	FAULT_EXPECTED

	PKS_TEST_RDWR,		WRITE,	NO_FAULT_EXPECTED
	PKS_TEST_RDWR,		READ,	NO_FAULT_EXPECTED

Finally add pks_mk_noaccess() at the end of the test and in the crash
test to ensure that the pkey value is reset to the default at the
appropriate times.

Add documentation.

Operation from user space is simple:

	$ echo 1 > /sys/kernel/debug/x86/run_pks
	$ cat /sys/kernel/debug/x86/run_pks
	PASS

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Remove readonly test, as that patch is not needed for PMEM
	Split this off into a patch which follows the pks_mk_*()
		patches.  Thus allowing for a better view of how the
		test works compared to the functionality added with
		those patches.
	Remove unneeded prints
---
 lib/pks/pks_test.c | 168 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 167 insertions(+), 1 deletion(-)

diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index d84ab6e7a09c..fad9b996562a 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -14,6 +14,8 @@
  * where X is:
  *
  * * 0  Loop through all CPUs, report the msr, and check against the default.
+ * * 1  Allocate a single key and check all 3 permissions on a page.
+ * * 8  Loop through all CPUs, report the msr, and check against the default.
  * * 9  Set up and fault on a PKS protected page.
  *
  * NOTE: 9 will fault on purpose.  Therefore, it requires the option to be
@@ -32,15 +34,21 @@
 #include <linux/vmalloc.h>
 #include <linux/pkeys.h>
 
+#include <asm/pks.h>
+
 #define PKS_TEST_MEM_SIZE (PAGE_SIZE)
 
 #define CHECK_DEFAULTS		0
+#define RUN_SINGLE		1
 #define RUN_CRASH_TEST		9
 
 static struct dentry *pks_test_dentry;
 static bool crash_armed;
 
 static bool last_test_pass;
+static int test_armed_key;
+static int fault_cnt;
+static int prev_fault_cnt;
 
 struct pks_test_ctx {
 	int pkey;
@@ -56,7 +64,102 @@ struct pks_test_ctx {
  */
 bool pks_test_callback(void)
 {
-	return false;
+	bool armed = (test_armed_key != 0);
+
+	if (armed) {
+		pks_mk_readwrite(test_armed_key);
+		fault_cnt++;
+	}
+
+	return armed;
+}
+
+static bool fault_caught(void)
+{
+	bool ret = (fault_cnt != prev_fault_cnt);
+
+	prev_fault_cnt = fault_cnt;
+	return ret;
+}
+
+enum pks_access_mode {
+	PKS_TEST_NO_ACCESS,
+	PKS_TEST_RDWR,
+};
+
+#define PKS_WRITE true
+#define PKS_READ false
+#define PKS_FAULT_EXPECTED true
+#define PKS_NO_FAULT_EXPECTED false
+
+static char *get_mode_str(enum pks_access_mode mode)
+{
+	switch (mode) {
+	case PKS_TEST_NO_ACCESS:
+		return "No Access";
+	case PKS_TEST_RDWR:
+		return "Read Write";
+	default:
+		pr_err("BUG in test invalid mode\n");
+		break;
+	}
+
+	return "";
+}
+
+struct pks_access_test {
+	enum pks_access_mode mode;
+	bool write;
+	bool fault;
+};
+
+static struct pks_access_test pkey_test_ary[] = {
+	{ PKS_TEST_NO_ACCESS,     PKS_WRITE,  PKS_FAULT_EXPECTED },
+	{ PKS_TEST_NO_ACCESS,     PKS_READ,   PKS_FAULT_EXPECTED },
+
+	{ PKS_TEST_RDWR,          PKS_WRITE,  PKS_NO_FAULT_EXPECTED },
+	{ PKS_TEST_RDWR,          PKS_READ,   PKS_NO_FAULT_EXPECTED },
+};
+
+static bool run_access_test(struct pks_test_ctx *ctx,
+			   struct pks_access_test *test,
+			   void *ptr)
+{
+	bool fault;
+
+	switch (test->mode) {
+	case PKS_TEST_NO_ACCESS:
+		pks_mk_noaccess(ctx->pkey);
+		break;
+	case PKS_TEST_RDWR:
+		pks_mk_readwrite(ctx->pkey);
+		break;
+	default:
+		pr_err("BUG in test invalid mode\n");
+		return false;
+	}
+
+	WRITE_ONCE(test_armed_key, ctx->pkey);
+
+	if (test->write)
+		memcpy(ptr, ctx->data, 8);
+	else
+		memcpy(ctx->data, ptr, 8);
+
+	fault = fault_caught();
+
+	WRITE_ONCE(test_armed_key, 0);
+
+	if (test->fault != fault) {
+		pr_err("pkey test FAILED: mode %s; write %s; fault %s != %s\n",
+			get_mode_str(test->mode),
+			test->write ? "TRUE" : "FALSE",
+			test->fault ? "YES" : "NO",
+			fault ? "YES" : "NO");
+		return false;
+	}
+
+	return true;
 }
 
 static void *alloc_test_page(int pkey)
@@ -66,6 +169,48 @@ static void *alloc_test_page(int pkey)
 				    NUMA_NO_NODE, __builtin_return_address(0));
 }
 
+static bool test_ctx(struct pks_test_ctx *ctx)
+{
+	bool rc = true;
+	int i;
+	u8 pkey;
+	void *ptr = NULL;
+	pte_t *ptep = NULL;
+	unsigned int level;
+
+	ptr = alloc_test_page(ctx->pkey);
+	if (!ptr) {
+		pr_err("Failed to vmalloc page???\n");
+		return false;
+	}
+
+	ptep = lookup_address((unsigned long)ptr, &level);
+	if (!ptep) {
+		pr_err("Failed to lookup address???\n");
+		rc = false;
+		goto done;
+	}
+
+	pkey = pte_flags_pkey(ptep->pte);
+	if (pkey != ctx->pkey) {
+		pr_err("invalid pkey found: %u, test_pkey: %u\n",
+			pkey, ctx->pkey);
+		rc = false;
+		goto done;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(pkey_test_ary); i++) {
+		/* sticky fail */
+		if (!run_access_test(ctx, &pkey_test_ary[i], ptr))
+			rc = false;
+	}
+
+done:
+	vfree(ptr);
+
+	return rc;
+}
+
 static struct pks_test_ctx *alloc_ctx(u8 pkey)
 {
 	struct pks_test_ctx *ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
@@ -85,6 +230,22 @@ static void free_ctx(struct pks_test_ctx *ctx)
 	kfree(ctx);
 }
 
+static bool run_single(void)
+{
+	struct pks_test_ctx *ctx;
+	bool rc;
+
+	ctx = alloc_ctx(PKS_KEY_TEST);
+	if (IS_ERR(ctx))
+		return false;
+
+	rc = test_ctx(ctx);
+	pks_mk_noaccess(ctx->pkey);
+	free_ctx(ctx);
+
+	return rc;
+}
+
 static void crash_it(void)
 {
 	struct pks_test_ctx *ctx;
@@ -104,6 +265,8 @@ static void crash_it(void)
 		return;
 	}
 
+	pks_mk_noaccess(ctx->pkey);
+
 	/* This purposely faults */
 	memcpy(ptr, ctx->data, 8);
 
@@ -185,6 +348,9 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
 	case CHECK_DEFAULTS:
 		on_each_cpu(check_pkey_settings, NULL, 1);
 		break;
+	case RUN_SINGLE:
+		last_test_pass = run_single();
+		break;
 	default:
 		last_test_pass = false;
 		break;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 20/44] mm/pkeys: Add PKS test for context switching
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (18 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 19/44] mm/pkeys: PKS Testing, add pks_mk_*() tests ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-02-01 17:43   ` Edgecombe, Rick P
  2022-02-01 17:47   ` Edgecombe, Rick P
  2022-01-27 17:54 ` [PATCH V8 21/44] x86/entry: Add auxiliary pt_regs space ira.weiny
                   ` (23 subsequent siblings)
  43 siblings, 2 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

PKS software must maintain the PKRS value for each thread in the system.
It must then restore this value whenever a thread is scheduled in.

Create a user space test to test this.  The test runs 2 processes
simultaneously on the same CPU.  One sets up a known PKS value for the
test pkey and sleeps while the other runs through all the protections
using the same pkey.  The first process is then allowed to run and it
checks that its PKRS value was properly restored.

On the kernel side 2 additional commands are added.  One is a mechanism
to arm a context and the other checks that context.  The kernel
maintains this context while the char device remains open.  The context
is cleaned up with the fd is closed.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Split this off from the main testing patch
	Remove unneeded prints
---
 lib/pks/pks_test.c                     |  74 +++++++++++
 tools/testing/selftests/x86/Makefile   |   2 +-
 tools/testing/selftests/x86/test_pks.c | 168 +++++++++++++++++++++++++
 3 files changed, 243 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/test_pks.c

diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index fad9b996562a..933f1bed4820 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -15,6 +15,8 @@
  *
  * * 0  Loop through all CPUs, report the msr, and check against the default.
  * * 1  Allocate a single key and check all 3 permissions on a page.
+ * * 2  'arm context' for context switch test
+ * * 3  Check the context armed in '2' to ensure the MSR value was preserved
  * * 8  Loop through all CPUs, report the msr, and check against the default.
  * * 9  Set up and fault on a PKS protected page.
  *
@@ -24,6 +26,11 @@
  * $ cat /sys/kernel/debug/x86/run_pks
  *
  * Will print the result of the last test.
+ *
+ * To automate context switch testing a user space program is provided in:
+ *
+ *	.../tools/testing/selftests/x86/test_pks.c
+ *
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -33,6 +40,9 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/pkeys.h>
+#include <uapi/asm-generic/mman-common.h>
+
+#include <asm/pks.h>
 
 #include <asm/pks.h>
 
@@ -40,6 +50,8 @@
 
 #define CHECK_DEFAULTS		0
 #define RUN_SINGLE		1
+#define ARM_CTX_SWITCH		2
+#define CHECK_CTX_SWITCH	3
 #define RUN_CRASH_TEST		9
 
 static struct dentry *pks_test_dentry;
@@ -309,6 +321,55 @@ static void arm_or_run_crash_test(void)
 	crash_armed = false;
 }
 
+static void arm_ctx_switch(struct file *file)
+{
+	struct pks_test_ctx *ctx;
+
+	ctx = alloc_ctx(PKS_KEY_TEST);
+	if (IS_ERR(ctx)) {
+		pr_err("Failed to allocate a context\n");
+		last_test_pass = false;
+		return;
+	}
+
+	/* Store context for later checks */
+	if (file->private_data) {
+		pr_warn("Context already armed\n");
+		free_ctx(file->private_data);
+	}
+	file->private_data = ctx;
+
+	/* Ensure a known state to test context switch */
+	pks_mk_readwrite(ctx->pkey);
+}
+
+static void check_ctx_switch(struct file *file)
+{
+	struct pks_test_ctx *ctx;
+	unsigned long reg_pkrs;
+	int access;
+
+	last_test_pass = true;
+
+	if (!file->private_data) {
+		pr_err("No Context switch configured\n");
+		last_test_pass = false;
+		return;
+	}
+
+	ctx = file->private_data;
+
+	rdmsrl(MSR_IA32_PKRS, reg_pkrs);
+
+	access = (reg_pkrs >> PKR_PKEY_SHIFT(ctx->pkey)) &
+		  PKEY_ACCESS_MASK;
+	if (access != 0) {
+		last_test_pass = false;
+		pr_err("Context switch check failed: pkey %d: 0x%x reg: 0x%lx\n",
+			ctx->pkey, access, reg_pkrs);
+	}
+}
+
 static ssize_t pks_read_file(struct file *file, char __user *user_buf,
 			     size_t count, loff_t *ppos)
 {
@@ -351,6 +412,14 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
 	case RUN_SINGLE:
 		last_test_pass = run_single();
 		break;
+	case ARM_CTX_SWITCH:
+		/* start of context switch test */
+		arm_ctx_switch(file);
+		break;
+	case CHECK_CTX_SWITCH:
+		/* After context switch MSR should be restored */
+		check_ctx_switch(file);
+		break;
 	default:
 		last_test_pass = false;
 		break;
@@ -365,6 +434,11 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
 
 static int pks_release_file(struct inode *inode, struct file *file)
 {
+	struct pks_test_ctx *ctx = file->private_data;
+
+	if (ctx)
+		free_ctx(ctx);
+
 	return 0;
 }
 
diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 8a1f62ab3c8e..e08670596c14 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -13,7 +13,7 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh $(CC) trivial_program.c -no-pie)
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \
 			check_initial_reg_state sigreturn iopl ioperm \
 			test_vsyscall mov_ss_trap \
-			syscall_arg_fault fsgsbase_restore sigaltstack
+			syscall_arg_fault fsgsbase_restore sigaltstack test_pks
 TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
new file mode 100644
index 000000000000..9a24a4a61f28
--- /dev/null
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -0,0 +1,168 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright(c) 2021 Intel Corporation. All rights reserved.
+ *
+ * User space tool to test PKS operations.  Accesses test code through
+ * <debugfs>/x86/run_pks when CONFIG_PKS_TEST is enabled.
+ */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <unistd.h>
+#include <assert.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <string.h>
+#include <stdbool.h>
+
+#define PKS_TEST_FILE "/sys/kernel/debug/x86/run_pks"
+
+#define RUN_SINGLE		"1"
+#define ARM_CTX_SWITCH		"2"
+#define CHECK_CTX_SWITCH	"3"
+
+void print_help_and_exit(char *argv0)
+{
+	printf("Usage: %s [-h] <cpu>\n", argv0);
+	printf("	--help,-h  This help\n");
+	printf("\n");
+	printf("	Run a context switch test on <cpu> (Default: 0)\n");
+}
+
+int check_context_switch(int cpu)
+{
+	int switch_done[2];
+	int setup_done[2];
+	cpu_set_t cpuset;
+	char result[32];
+	int rc = 0;
+	pid_t pid;
+	int fd;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(cpu, &cpuset);
+	/*
+	 * Ensure the two processes run on the same CPU so that they go through
+	 * a context switch.
+	 */
+	sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset);
+
+	if (pipe(setup_done)) {
+		printf("ERROR: Failed to create pipe\n");
+		return -1;
+	}
+	if (pipe(switch_done)) {
+		printf("ERROR: Failed to create pipe\n");
+		return -1;
+	}
+
+	pid = fork();
+	if (pid == 0) {
+		char done = 'y';
+
+		fd = open(PKS_TEST_FILE, O_RDWR);
+		if (fd < 0) {
+			printf("ERROR: cannot open %s\n", PKS_TEST_FILE);
+			return -1;
+		}
+
+		cpu = sched_getcpu();
+		printf("Child running on cpu %d...\n", cpu);
+
+		/* Allocate and run test. */
+		write(fd, RUN_SINGLE, 1);
+
+		/* Arm for context switch test */
+		write(fd, ARM_CTX_SWITCH, 1);
+
+		printf("   tell parent to go\n");
+		write(setup_done[1], &done, sizeof(done));
+
+		/* Context switch out... */
+		printf("   Waiting for parent...\n");
+		read(switch_done[0], &done, sizeof(done));
+
+		/* Check msr restored */
+		printf("Checking result\n");
+		write(fd, CHECK_CTX_SWITCH, 1);
+
+		read(fd, result, 10);
+		printf("   #PF, context switch, pkey allocation and free tests: %s\n", result);
+		if (!strncmp(result, "PASS", 10)) {
+			rc = -1;
+			done = 'F';
+		}
+
+		/* Signal result */
+		write(setup_done[1], &done, sizeof(done));
+	} else {
+		char done = 'y';
+
+		read(setup_done[0], &done, sizeof(done));
+		cpu = sched_getcpu();
+		printf("Parent running on cpu %d\n", cpu);
+
+		fd = open(PKS_TEST_FILE, O_RDWR);
+		if (fd < 0) {
+			printf("ERROR: cannot open %s\n", PKS_TEST_FILE);
+			return -1;
+		}
+
+		/* run test with the same pkey */
+		write(fd, RUN_SINGLE, 1);
+
+		printf("   Signaling child.\n");
+		write(switch_done[1], &done, sizeof(done));
+
+		/* Wait for result */
+		read(setup_done[0], &done, sizeof(done));
+		if (done == 'F')
+			rc = -1;
+	}
+
+	close(fd);
+
+	return rc;
+}
+
+int main(int argc, char *argv[])
+{
+	int cpu = 0;
+	int rc;
+	int c;
+
+	while (1) {
+		int option_index = 0;
+		static struct option long_options[] = {
+			{"help",	no_argument,	0,	'h' },
+			{0,		0,		0,	0 }
+		};
+
+		c = getopt_long(argc, argv, "h", long_options, &option_index);
+		if (c == -1)
+			break;
+
+		switch (c) {
+		case 'h':
+			print_help_and_exit(argv[0]);
+			break;
+		}
+	}
+
+	if (optind < argc)
+		cpu = strtoul(argv[optind], NULL, 0);
+
+	if (cpu >= sysconf(_SC_NPROCESSORS_ONLN)) {
+		printf("CPU %d is invalid\n", cpu);
+		cpu = sysconf(_SC_NPROCESSORS_ONLN) - 1;
+		printf("   running on max CPU: %d\n", cpu);
+	}
+
+	rc = check_context_switch(cpu);
+
+	return rc;
+}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 21/44] x86/entry: Add auxiliary pt_regs space
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (19 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 20/44] mm/pkeys: Add PKS test for context switching ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 22/44] entry: Pass pt_regs to irqentry_exit_cond_resched() ira.weiny
                   ` (22 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The PKRS MSR is not managed by XSAVE.  In order for the MSR to be saved
during an exception the current CPU MSR value needs to be saved
somewhere during the exception and restored when returning to the
previous context.

Two possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2] However, Andy
came up with a way to hide additional values on the stack which could be
accessed as "extended_pt_regs".[3] This method allows any place which
has struct pt_regs to get access to the extra information with no extra
information being added to irq_state and pt_regs is left intact for
compatibility with outside tools like BPF.

Prepare the assembly code to add a hidden auxiliary pt_regs space.  To
simplify, the assembly code only adds space on the stack.  The use of
this space is left to the C code which is required to select
ARCH_HAS_PTREGS_AUXILIARY to enable this support.

Each nested exception gets another copy of this auxiliary space allowing
for any number of levels of exception handling.

Initially the space is left empty and results in no code changes because
ARCH_HAS_PTREGS_AUXILIARY is not set.  Subsequent patches adding data to
pt_regs_auxiliary must set ARCH_HAS_PTREGS_AUXILIARY or a build failure
will occur.  The use of ARCH_HAS_PTREGS_AUXILIARY also avoids the
introduction of 2 instructions (addq/subq) on every entry call when the
extra space is not needed.

32bit is specifically excluded.

Peter, Thomas, Andy, Dave, and Dan all suggested parts of the patch or
aided in the development of the patch..

[1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
[2] https://lore.kernel.org/lkml/874kpxx4jf.fsf@nanos.tec.linutronix.de/#t
[3] https://lore.kernel.org/lkml/CALCETrUHwZPic89oExMMe-WyDY8-O3W68NcZvse3=PGW+iW5=w@mail.gmail.com/

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8:
	Exclude 32bit
	Introduce ARCH_HAS_PTREGS_AUXILIARY to optimize this away when
		not needed.
	From Thomas
		s/EXTENDED_PT_REGS_SIZE/PT_REGS_AUX_SIZE
		Fix up PTREGS_AUX_SIZE macro to be based on the
			structures and used in assembly code via the
			nifty asm-offset macros
		Bound calls into c code with [PUSH|POP]_RTREGS_AUXILIARY
			instead of using a macro 'call'
	Split this patch out and put the PKS specific stuff in a
		separate patch

Changes for V7:
	Rebased to 5.14 entry code
	declare write_pkrs() in pks.h
	s/INIT_PKRS_VALUE/pkrs_init_value
	Remove unnecessary INIT_PKRS_VALUE def
	s/pkrs_save_set_irq/pkrs_save_irq/
		The inital value for exceptions is best managed
		completely within the pkey code.
---
 arch/x86/Kconfig                 |  4 ++++
 arch/x86/entry/calling.h         | 20 ++++++++++++++++++++
 arch/x86/entry/entry_64.S        | 22 ++++++++++++++++++++++
 arch/x86/entry/entry_64_compat.S |  6 ++++++
 arch/x86/include/asm/ptrace.h    | 19 +++++++++++++++++++
 arch/x86/kernel/asm-offsets_64.c | 15 +++++++++++++++
 arch/x86/kernel/head_64.S        |  6 ++++++
 7 files changed, 92 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a30fe85e27ac..82342f27b218 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1877,6 +1877,10 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
 
 	  If unsure, say y.
 
+config ARCH_HAS_PTREGS_AUXILIARY
+	depends on X86_64
+	bool
+
 choice
 	prompt "TSX enable mode"
 	depends on CPU_SUP_INTEL
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index a4c061fb7c6e..d0ebf9b069c9 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -63,6 +63,26 @@ For 32-bit we have the following conventions - kernel is built with
  * for assembly code:
  */
 
+
+#ifdef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+
+.macro PUSH_PTREGS_AUXILIARY
+	/* add space for pt_regs_auxiliary */
+	subq $PTREGS_AUX_SIZE, %rsp
+.endm
+
+.macro POP_PTREGS_AUXILIARY
+	/* remove space for pt_regs_auxiliary */
+	addq $PTREGS_AUX_SIZE, %rsp
+.endm
+
+#else
+
+#define PUSH_PTREGS_AUXILIARY
+#define POP_PTREGS_AUXILIARY
+
+#endif
+
 .macro PUSH_REGS rdx=%rdx rax=%rax save_ret=0
 	.if \save_ret
 	pushq	%rsi		/* pt_regs->si */
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 466df3e50276..0684a8093965 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -332,7 +332,9 @@ SYM_CODE_END(ret_from_fork)
 		movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
 	.endif
 
+	PUSH_PTREGS_AUXILIARY
 	call	\cfunc
+	POP_PTREGS_AUXILIARY
 
 	jmp	error_return
 .endm
@@ -435,7 +437,9 @@ SYM_CODE_START(\asmsym)
 
 	movq	%rsp, %rdi		/* pt_regs pointer */
 
+	PUSH_PTREGS_AUXILIARY
 	call	\cfunc
+	POP_PTREGS_AUXILIARY
 
 	jmp	paranoid_exit
 
@@ -496,7 +500,9 @@ SYM_CODE_START(\asmsym)
 	 * stack.
 	 */
 	movq	%rsp, %rdi		/* pt_regs pointer */
+	PUSH_PTREGS_AUXILIARY
 	call	vc_switch_off_ist
+	POP_PTREGS_AUXILIARY
 	movq	%rax, %rsp		/* Switch to new stack */
 
 	UNWIND_HINT_REGS
@@ -507,7 +513,9 @@ SYM_CODE_START(\asmsym)
 
 	movq	%rsp, %rdi		/* pt_regs pointer */
 
+	PUSH_PTREGS_AUXILIARY
 	call	kernel_\cfunc
+	POP_PTREGS_AUXILIARY
 
 	/*
 	 * No need to switch back to the IST stack. The current stack is either
@@ -542,7 +550,9 @@ SYM_CODE_START(\asmsym)
 	movq	%rsp, %rdi		/* pt_regs pointer into first argument */
 	movq	ORIG_RAX(%rsp), %rsi	/* get error code into 2nd argument*/
 	movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
+	PUSH_PTREGS_AUXILIARY
 	call	\cfunc
+	POP_PTREGS_AUXILIARY
 
 	jmp	paranoid_exit
 
@@ -784,7 +794,9 @@ SYM_CODE_START_LOCAL(exc_xen_hypervisor_callback)
 	movq	%rdi, %rsp			/* we don't return, adjust the stack frame */
 	UNWIND_HINT_REGS
 
+	PUSH_PTREGS_AUXILIARY
 	call	xen_pv_evtchn_do_upcall
+	POP_PTREGS_AUXILIARY
 
 	jmp	error_return
 SYM_CODE_END(exc_xen_hypervisor_callback)
@@ -984,7 +996,9 @@ SYM_CODE_START_LOCAL(error_entry)
 	/* Put us onto the real thread stack. */
 	popq	%r12				/* save return addr in %12 */
 	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
+	PUSH_PTREGS_AUXILIARY
 	call	sync_regs
+	POP_PTREGS_AUXILIARY
 	movq	%rax, %rsp			/* switch stack */
 	ENCODE_FRAME_POINTER
 	pushq	%r12
@@ -1040,7 +1054,9 @@ SYM_CODE_START_LOCAL(error_entry)
 	 * as if we faulted immediately after IRET.
 	 */
 	mov	%rsp, %rdi
+	PUSH_PTREGS_AUXILIARY
 	call	fixup_bad_iret
+	POP_PTREGS_AUXILIARY
 	mov	%rax, %rsp
 	jmp	.Lerror_entry_from_usermode_after_swapgs
 SYM_CODE_END(error_entry)
@@ -1146,7 +1162,9 @@ SYM_CODE_START(asm_exc_nmi)
 
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
+	PUSH_PTREGS_AUXILIARY
 	call	exc_nmi
+	POP_PTREGS_AUXILIARY
 
 	/*
 	 * Return back to user mode.  We must *not* do the normal exit
@@ -1182,6 +1200,8 @@ SYM_CODE_START(asm_exc_nmi)
 	 * +---------------------------------------------------------+
 	 * | pt_regs                                                 |
 	 * +---------------------------------------------------------+
+	 * | (Optionally) pt_regs_extended                           |
+	 * +---------------------------------------------------------+
 	 *
 	 * The "original" frame is used by hardware.  Before re-enabling
 	 * NMIs, we need to be done with it, and we need to leave enough
@@ -1358,7 +1378,9 @@ end_repeat_nmi:
 
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
+	PUSH_PTREGS_AUXILIARY
 	call	exc_nmi
+	POP_PTREGS_AUXILIARY
 
 	/* Always restore stashed CR3 value (see paranoid_entry) */
 	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 0051cf5c792d..c6859d8acae4 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -136,7 +136,9 @@ SYM_INNER_LABEL(entry_SYSENTER_compat_after_hwframe, SYM_L_GLOBAL)
 .Lsysenter_flags_fixed:
 
 	movq	%rsp, %rdi
+	PUSH_PTREGS_AUXILIARY
 	call	do_SYSENTER_32
+	POP_PTREGS_AUXILIARY
 	/* XEN PV guests always use IRET path */
 	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
 		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
@@ -253,7 +255,9 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
 	UNWIND_HINT_REGS
 
 	movq	%rsp, %rdi
+	PUSH_PTREGS_AUXILIARY
 	call	do_fast_syscall_32
+	POP_PTREGS_AUXILIARY
 	/* XEN PV guests always use IRET path */
 	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
 		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
@@ -410,6 +414,8 @@ SYM_CODE_START(entry_INT80_compat)
 	cld
 
 	movq	%rsp, %rdi
+	PUSH_PTREGS_AUXILIARY
 	call	do_int80_syscall_32
+	POP_PTREGS_AUXILIARY
 	jmp	swapgs_restore_regs_and_return_to_usermode
 SYM_CODE_END(entry_INT80_compat)
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 703663175a5a..79541682e7f7 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -2,11 +2,13 @@
 #ifndef _ASM_X86_PTRACE_H
 #define _ASM_X86_PTRACE_H
 
+#include <linux/container_of.h>
 #include <asm/segment.h>
 #include <asm/page_types.h>
 #include <uapi/asm/ptrace.h>
 
 #ifndef __ASSEMBLY__
+
 #ifdef __i386__
 
 struct pt_regs {
@@ -91,6 +93,23 @@ struct pt_regs {
 /* top of stack page */
 };
 
+/*
+ * NOTE: Features which add data to pt_regs_auxiliary must select
+ * ARCH_HAS_PTREGS_AUXILIARY.  Failure to do so will result in a build failure.
+ */
+struct pt_regs_auxiliary {
+};
+
+struct pt_regs_extended {
+	struct pt_regs_auxiliary aux;
+	struct pt_regs pt_regs __aligned(8);
+};
+
+static inline struct pt_regs_extended *to_extended_pt_regs(struct pt_regs *regs)
+{
+	return container_of(regs, struct pt_regs_extended, pt_regs);
+}
+
 #endif /* !__i386__ */
 
 #ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index b14533af7676..66f08ac3507a 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -4,6 +4,7 @@
 #endif
 
 #include <asm/ia32.h>
+#include <asm/ptrace.h>
 
 #if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS)
 #include <asm/kvm_para.h>
@@ -60,5 +61,19 @@ int main(void)
 	DEFINE(stack_canary_offset, offsetof(struct fixed_percpu_data, stack_canary));
 	BLANK();
 #endif
+
+#ifdef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+	/* Size of Auxiliary pt_regs data */
+	DEFINE(PTREGS_AUX_SIZE, sizeof(struct pt_regs_extended) -
+				sizeof(struct pt_regs));
+#else
+	/*
+	 * Adding data to struct pt_regs_auxiliary requires setting
+	 * ARCH_HAS_PTREGS_AUXILIARY
+	 */
+	BUILD_BUG_ON((sizeof(struct pt_regs_extended) -
+		      sizeof(struct pt_regs)) != 0);
+#endif
+
 	return 0;
 }
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9c63fc5988cd..8418d9de8d70 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -336,8 +336,10 @@ SYM_CODE_START_NOALIGN(vc_boot_ghcb)
 	movq    %rsp, %rdi
 	movq	ORIG_RAX(%rsp), %rsi
 	movq	initial_vc_handler(%rip), %rax
+	PUSH_PTREGS_AUXILIARY
 	ANNOTATE_RETPOLINE_SAFE
 	call	*%rax
+	POP_PTREGS_AUXILIARY
 
 	/* Unwind pt_regs */
 	POP_REGS
@@ -414,7 +416,9 @@ SYM_CODE_START_LOCAL(early_idt_handler_common)
 	UNWIND_HINT_REGS
 
 	movq %rsp,%rdi		/* RDI = pt_regs; RSI is already trapnr */
+	PUSH_PTREGS_AUXILIARY
 	call do_early_exception
+	POP_PTREGS_AUXILIARY
 
 	decl early_recursion_flag(%rip)
 	jmp restore_regs_and_return_to_kernel
@@ -438,7 +442,9 @@ SYM_CODE_START_NOALIGN(vc_no_ghcb)
 	/* Call C handler */
 	movq    %rsp, %rdi
 	movq	ORIG_RAX(%rsp), %rsi
+	PUSH_PTREGS_AUXILIARY
 	call    do_vc_no_ghcb
+	POP_PTREGS_AUXILIARY
 
 	/* Unwind pt_regs */
 	POP_REGS
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 22/44] entry: Pass pt_regs to irqentry_exit_cond_resched()
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (20 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 21/44] x86/entry: Add auxiliary pt_regs space ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 23/44] entry: Add architecture auxiliary pt_regs save/restore calls ira.weiny
                   ` (21 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Auxiliary pt_regs space needs to be manipulated by the generic
entry/exit code.

Unfortunately, the call to irqentry_exit_cond_resched() from
xen_pv_evtchn_do_upcall() bypasses the 'normal' irqentry_exit() call.

Normally the irqentry_exit() would take care of handling any auxiliary
pt_regs but because of this bypass irqentry_exit_cond_resched() is
required to handle it.

Add pt_regs to irqentry_exit_cond_resched() so that any auxiliary
pt_regs data can be handled.

Create an internal exit_cond_resched() call for irqentry_exit() to avoid
passing pt_regs because irqentry_exit() will directly handle any
auxiliary pt_regs data.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	New Patch
---
 arch/x86/entry/common.c      | 2 +-
 include/linux/entry-common.h | 3 ++-
 kernel/entry/common.c        | 9 +++++++--
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6c2826417b33..f1ba770d035d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -309,7 +309,7 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 
 	inhcall = get_and_clear_inhcall();
 	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
-		irqentry_exit_cond_resched();
+		irqentry_exit_cond_resched(regs);
 		instrumentation_end();
 		restore_inhcall(inhcall);
 	} else {
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index ddaffc983e62..14fd329847e7 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -451,10 +451,11 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
 
 /**
  * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
+ * @regs:	Pointer to pt_regs of interrupted context
  *
  * Conditional reschedule with additional sanity checks.
  */
-void irqentry_exit_cond_resched(void);
+void irqentry_exit_cond_resched(struct pt_regs *regs);
 
 void __irqentry_exit_cond_resched(void);
 #ifdef CONFIG_PREEMPT_DYNAMIC
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 490442a48332..f4210a7fc84d 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -395,7 +395,7 @@ void __irqentry_exit_cond_resched(void)
 DEFINE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
 #endif
 
-void irqentry_exit_cond_resched(void)
+static void exit_cond_resched(void)
 {
 	if (IS_ENABLED(CONFIG_PREEMPTION)) {
 #ifdef CONFIG_PREEMPT_DYNAMIC
@@ -406,6 +406,11 @@ void irqentry_exit_cond_resched(void)
 	}
 }
 
+void irqentry_exit_cond_resched(struct pt_regs *regs)
+{
+	exit_cond_resched();
+}
+
 noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 {
 	lockdep_assert_irqs_disabled();
@@ -431,7 +436,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 		}
 
 		instrumentation_begin();
-		irqentry_exit_cond_resched();
+		exit_cond_resched();
 		/* Covers both tracing and lockdep */
 		trace_hardirqs_on();
 		instrumentation_end();
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 23/44] entry: Add architecture auxiliary pt_regs save/restore calls
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (21 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 22/44] entry: Pass pt_regs to irqentry_exit_cond_resched() ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 24/44] x86/entry: Define arch_{save|restore}_auxiliary_pt_regs() ira.weiny
                   ` (20 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Some architectures have auxiliary pt_regs space which is available to
store extra information on the stack.  For ease of implementation the
common C code was left to fill in the data when needed.

Define C calls for architectures to save and restore any auxiliary data
they may need and call those from the common entry code.

NOTE: Due to the split nature of the Xen exit code
irqentry_exit_cond_resched() requires an unbalanced call to
arch_restore_aux_pt_regs() regardless of the nature of the preemption
configuration.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	New patch which introduces a generic auxiliary pt_register save
		restore.
---
 include/linux/entry-common.h |  7 +++++++
 kernel/entry/common.c        | 16 ++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 14fd329847e7..b243f1cfd491 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -99,6 +99,13 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
 }
 #endif
 
+#ifndef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+
+static inline void arch_save_aux_pt_regs(struct pt_regs *regs) { }
+static inline void arch_restore_aux_pt_regs(struct pt_regs *regs) { }
+
+#endif
+
 /**
  * enter_from_user_mode - Establish state when coming from user mode
  *
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f4210a7fc84d..c778e9783361 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -323,7 +323,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 
 	if (user_mode(regs)) {
 		irqentry_enter_from_user_mode(regs);
-		return ret;
+		goto aux_save;
 	}
 
 	/*
@@ -362,7 +362,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 		instrumentation_end();
 
 		ret.exit_rcu = true;
-		return ret;
+		goto aux_save;
 	}
 
 	/*
@@ -377,6 +377,11 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 	trace_hardirqs_off_finish();
 	instrumentation_end();
 
+aux_save:
+	instrumentation_begin();
+	arch_save_aux_pt_regs(regs);
+	instrumentation_end();
+
 	return ret;
 }
 
@@ -408,6 +413,7 @@ static void exit_cond_resched(void)
 
 void irqentry_exit_cond_resched(struct pt_regs *regs)
 {
+	arch_restore_aux_pt_regs(regs);
 	exit_cond_resched();
 }
 
@@ -415,6 +421,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 {
 	lockdep_assert_irqs_disabled();
 
+	instrumentation_begin();
+	arch_restore_aux_pt_regs(regs);
+	instrumentation_end();
+
 	/* Check whether this returns to user mode */
 	if (user_mode(regs)) {
 		irqentry_exit_to_user_mode(regs);
@@ -464,6 +474,7 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 	instrumentation_begin();
 	trace_hardirqs_off_finish();
 	ftrace_nmi_enter();
+	arch_save_aux_pt_regs(regs);
 	instrumentation_end();
 
 	return irq_state;
@@ -472,6 +483,7 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
 {
 	instrumentation_begin();
+	arch_restore_aux_pt_regs(regs);
 	ftrace_nmi_exit();
 	if (irq_state.lockdep) {
 		trace_hardirqs_on_prepare();
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 24/44] x86/entry: Define arch_{save|restore}_auxiliary_pt_regs()
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (22 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 23/44] entry: Add architecture auxiliary pt_regs save/restore calls ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 25/44] x86/pkeys: Preserve PKRS MSR across exceptions ira.weiny
                   ` (19 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The x86 architecture supports the new auxiliary pt_regs space if
ARCH_HAS_PTREGS_AUXILIARY is enabled.

Define the callbacks within the x86 code required by the core entry code
when this support is enabled.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	New patch
---
 arch/x86/include/asm/entry-common.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 43184640b579..5fa5dd2d539c 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -95,4 +95,16 @@ static __always_inline void arch_exit_to_user_mode(void)
 }
 #define arch_exit_to_user_mode arch_exit_to_user_mode
 
+#ifdef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+
+static inline void arch_save_aux_pt_regs(struct pt_regs *regs)
+{
+}
+
+static inline void arch_restore_aux_pt_regs(struct pt_regs *regs)
+{
+}
+
+#endif
+
 #endif
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 25/44] x86/pkeys: Preserve PKRS MSR across exceptions
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (23 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 24/44] x86/entry: Define arch_{save|restore}_auxiliary_pt_regs() ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 26/44] x86/fault: Print PKS MSR on fault ira.weiny
                   ` (18 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

PKRS is a per-logical-processor MSR which overlays additional protection
for pages which have been mapped with a protection key.  It is desired
to protect PKS pages while executing exception code.  While in the
exception code can alter the PKS permissions if necessary for any access
it may require.

To do this the current thread value must be saved, the CPU MSR value set
to the default value, and the saved value restored upon completion of
the exception.  This can be done with the new auxiliary pt_regs space.

Turn on the new auxiliary pt_regs space by triggering
ARCH_HAS_PTREGS_AUXILIARY.  This is done by making
ARCH_HAS_PTREGS_AUXILIARY default yes and then dependent on
ARCH_ENABLE_SUPERVISOR_PKEYS.  Additional users of the auxiliary space
can OR in their Kconfig options as needed.

Then define pks_{save|restore}_pt_regs() to use the auxiliary space to
store the thread PKRS value across exceptions.  Call pks_*_pt_regs()
from arch_{save|restore}_aux_pt_regs()

Update the PKS test code to properly clear the saved thread PKRS value
before returning to ensure current tests work with this change.

Peter, Thomas, Andy, Dave, and Dan all suggested parts of the patch or
aided in the development of the patch.

[1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
[2] https://lore.kernel.org/lkml/874kpxx4jf.fsf@nanos.tec.linutronix.de/#t
[3] https://lore.kernel.org/lkml/CALCETrUHwZPic89oExMMe-WyDY8-O3W68NcZvse3=PGW+iW5=w@mail.gmail.com/

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8:
	Tie this into the new generic auxiliary pt_regs support.
	Build this on the new irqentry_*() refactoring patches
	Split this patch off from the PKS portion of the auxiliary
		pt_regs functionality.
	From Thomas
		Fix noinstr mess
		s/write_pkrs/pks_write_pkrs
		s/pkrs_init_value/PKRS_INIT_VALUE
	Simplify the number and location of the save/restore calls.
		Cover entry from user space as well.

Changes for V7:
	Rebased to 5.14 entry code
	declare write_pkrs() in pks.h
	s/INIT_PKRS_VALUE/pkrs_init_value
	Remove unnecessary INIT_PKRS_VALUE def
	s/pkrs_save_set_irq/pkrs_save_irq/
		The inital value for exceptions is best managed
		completely within the pkey code.
---
 arch/x86/Kconfig                    |  3 ++-
 arch/x86/include/asm/entry-common.h |  3 +++
 arch/x86/include/asm/pks.h          |  8 ++++++--
 arch/x86/include/asm/ptrace.h       |  3 +++
 arch/x86/mm/fault.c                 |  2 +-
 arch/x86/mm/pkeys.c                 | 32 +++++++++++++++++++++++++++++
 lib/pks/pks_test.c                  | 11 ++++++++--
 7 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 82342f27b218..62685906f7c3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1878,8 +1878,9 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
 	  If unsure, say y.
 
 config ARCH_HAS_PTREGS_AUXILIARY
+	def_bool y
 	depends on X86_64
-	bool
+	depends on ARCH_ENABLE_SUPERVISOR_PKEYS
 
 choice
 	prompt "TSX enable mode"
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 5fa5dd2d539c..803727b95b3a 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -8,6 +8,7 @@
 #include <asm/nospec-branch.h>
 #include <asm/io_bitmap.h>
 #include <asm/fpu/api.h>
+#include <asm/pks.h>
 
 /* Check that the stack and regs on entry from user mode are sane. */
 static __always_inline void arch_check_user_regs(struct pt_regs *regs)
@@ -99,10 +100,12 @@ static __always_inline void arch_exit_to_user_mode(void)
 
 static inline void arch_save_aux_pt_regs(struct pt_regs *regs)
 {
+	pks_save_pt_regs(regs);
 }
 
 static inline void arch_restore_aux_pt_regs(struct pt_regs *regs)
 {
+	pks_restore_pt_regs(regs);
 }
 
 #endif
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index ee9fff5b4b13..82baa594cb3b 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -6,22 +6,26 @@
 
 void pks_setup(void);
 void pks_write_current(void);
+void pks_save_pt_regs(struct pt_regs *regs);
+void pks_restore_pt_regs(struct pt_regs *regs);
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 static inline void pks_setup(void) { }
 static inline void pks_write_current(void) { }
+static inline void pks_save_pt_regs(struct pt_regs *regs) { }
+static inline void pks_restore_pt_regs(struct pt_regs *regs) { }
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 
 #ifdef CONFIG_PKS_TEST
 
-bool pks_test_callback(void);
+bool pks_test_callback(struct pt_regs *regs);
 
 #else /* !CONFIG_PKS_TEST */
 
-static inline bool pks_test_callback(void)
+static inline bool pks_test_callback(struct pt_regs *regs)
 {
 	return false;
 }
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 79541682e7f7..f2527d6451b3 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -98,6 +98,9 @@ struct pt_regs {
  * ARCH_HAS_PTREGS_AUXILIARY.  Failure to do so will result in a build failure.
  */
 struct pt_regs_auxiliary {
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+	u32 pks_thread_pkrs;
+#endif
 };
 
 struct pt_regs_extended {
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index bef879943260..030eb3e08550 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1164,7 +1164,7 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 		 * is running.  If so, pks_test_callback() will clear the protection
 		 * mechanism and return true to indicate the fault was handled.
 		 */
-		if (pks_test_callback())
+		if (pks_test_callback(regs))
 			return;
 	}
 
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 7c6498fb8f8d..33b7f84ed33b 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -256,6 +256,38 @@ void pks_write_current(void)
 	pks_write_pkrs(current->thread.pks_saved_pkrs);
 }
 
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * To protect against exceptions having potentially privileged access to memory
+ * of an interrupted thread, save the current thread value and set the PKRS
+ * value to be used during the exception.
+ */
+void pks_save_pt_regs(struct pt_regs *regs)
+{
+	struct pt_regs_auxiliary *aux_pt_regs;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	aux_pt_regs = &to_extended_pt_regs(regs)->aux;
+	aux_pt_regs->pks_thread_pkrs = current->thread.pks_saved_pkrs;
+	pks_write_pkrs(PKS_INIT_VALUE);
+}
+
+void pks_restore_pt_regs(struct pt_regs *regs)
+{
+	struct pt_regs_auxiliary *aux_pt_regs;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	aux_pt_regs = &to_extended_pt_regs(regs)->aux;
+	current->thread.pks_saved_pkrs = aux_pt_regs->pks_thread_pkrs;
+	pks_write_pkrs(current->thread.pks_saved_pkrs);
+}
+
 /*
  * PKS is independent of PKU and either or both may be supported on a CPU.
  *
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 933f1bed4820..77f872829300 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -43,6 +43,7 @@
 #include <uapi/asm-generic/mman-common.h>
 
 #include <asm/pks.h>
+#include <asm/ptrace.h>       /* for struct pt_regs */
 
 #include <asm/pks.h>
 
@@ -74,12 +75,18 @@ struct pks_test_ctx {
  * NOTE: The callback is responsible for clearing any condition which would
  * cause the fault to re-trigger.
  */
-bool pks_test_callback(void)
+bool pks_test_callback(struct pt_regs *regs)
 {
+	struct pt_regs_extended *ept_regs = to_extended_pt_regs(regs);
+	struct pt_regs_auxiliary *aux_pt_regs = &ept_regs->aux;
 	bool armed = (test_armed_key != 0);
+	u32 pkrs = aux_pt_regs->pks_thread_pkrs;
 
 	if (armed) {
-		pks_mk_readwrite(test_armed_key);
+		/* Enable read and write to stop faults */
+		aux_pt_regs->pks_thread_pkrs = pkey_update_pkval(pkrs,
+								 test_armed_key,
+								 0);
 		fault_cnt++;
 	}
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 26/44] x86/fault: Print PKS MSR on fault
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (24 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 25/44] x86/pkeys: Preserve PKRS MSR across exceptions ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-02-01 18:13   ` Edgecombe, Rick P
  2022-01-27 17:54 ` [PATCH V8 27/44] mm/pkeys: Add PKS exception test ira.weiny
                   ` (17 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

If a PKS fault occurs it will be easier to debug it if the PKS MSR value
at the time of the fault is known.

Add pks_dump_fault_info() to dump the PKRS MSR on fault if enabled.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Split this into it's own patch.
---
 arch/x86/include/asm/pks.h |  2 ++
 arch/x86/mm/fault.c        |  3 +++
 arch/x86/mm/pkeys.c        | 11 +++++++++++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index 82baa594cb3b..fc3c66f1bb04 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -8,6 +8,7 @@ void pks_setup(void);
 void pks_write_current(void);
 void pks_save_pt_regs(struct pt_regs *regs);
 void pks_restore_pt_regs(struct pt_regs *regs);
+void pks_dump_fault_info(struct pt_regs *regs);
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
@@ -15,6 +16,7 @@ static inline void pks_setup(void) { }
 static inline void pks_write_current(void) { }
 static inline void pks_save_pt_regs(struct pt_regs *regs) { }
 static inline void pks_restore_pt_regs(struct pt_regs *regs) { }
+static inline void pks_dump_fault_info(struct pt_regs *regs) { }
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 030eb3e08550..697c06f08103 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -549,6 +549,9 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
 		 (error_code & X86_PF_PK)    ? "protection keys violation" :
 					       "permissions violation");
 
+	if (error_code & X86_PF_PK)
+		pks_dump_fault_info(regs);
+
 	if (!(error_code & X86_PF_USER) && user_mode(regs)) {
 		struct desc_ptr idt, gdt;
 		u16 ldtr, tr;
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 33b7f84ed33b..bdd700d5ad03 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -288,6 +288,17 @@ void pks_restore_pt_regs(struct pt_regs *regs)
 	pks_write_pkrs(current->thread.pks_saved_pkrs);
 }
 
+void pks_dump_fault_info(struct pt_regs *regs)
+{
+	struct pt_regs_auxiliary *aux_pt_regs;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	aux_pt_regs = &to_extended_pt_regs(regs)->aux;
+	pr_alert("PKRS: 0x%x\n", aux_pt_regs->pks_thread_pkrs);
+}
+
 /*
  * PKS is independent of PKU and either or both may be supported on a CPU.
  *
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 27/44] mm/pkeys: Add PKS exception test
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (25 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 26/44] x86/fault: Print PKS MSR on fault ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 28/44] mm/pkeys: Introduce pks_update_exception() ira.weiny
                   ` (16 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

During an exception the interrupted threads PKRS value must be preserved
and the exception should get the default value for that Pkey.  Upon
return from exception the threads PKRS value should be restored.

Add a PKS test which forces a fault and checks the values saved as well
as tests the ability for code to change the Pkey value during the
exception.

Do this by changing the interrupted thread Pkey to read only prior to
the exception.  The default test Pkey is no access and therefore should
be seen during the exception.  They switch to read/write during the
exception.  Finally ensure that the read only value is restored when
the exception is completed.

	$ echo 4 > /sys/kernel/debug/x86/run_pks
	$ cat /sys/kernel/debug/x86/run_pks
	PASS

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Change for V8
	Split this test off from the testing patch and place it after
	the exception saving code.
---
 arch/x86/include/asm/pks.h |   3 +
 arch/x86/mm/pkeys.c        |   2 +-
 lib/pks/pks_test.c         | 145 +++++++++++++++++++++++++++++++++++++
 3 files changed, 149 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index fc3c66f1bb04..065386c8bf37 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -24,9 +24,12 @@ static inline void pks_dump_fault_info(struct pt_regs *regs) { }
 #ifdef CONFIG_PKS_TEST
 
 bool pks_test_callback(struct pt_regs *regs);
+#define __static_or_pks_test
 
 #else /* !CONFIG_PKS_TEST */
 
+#define __static_or_pks_test static
+
 static inline bool pks_test_callback(struct pt_regs *regs)
 {
 	return false;
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index bdd700d5ad03..1da78580d6de 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -210,7 +210,7 @@ u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
 
 #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
 
-static DEFINE_PER_CPU(u32, pkrs_cache);
+__static_or_pks_test DEFINE_PER_CPU(u32, pkrs_cache);
 
 /*
  * pks_write_pkrs() - Write the pkrs of the current CPU
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 77f872829300..008a1079579d 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -17,6 +17,8 @@
  * * 1  Allocate a single key and check all 3 permissions on a page.
  * * 2  'arm context' for context switch test
  * * 3  Check the context armed in '2' to ensure the MSR value was preserved
+ * * 4  Test that the exception thread PKRS remains independent of the
+ *      interrupted threads PKRS
  * * 8  Loop through all CPUs, report the msr, and check against the default.
  * * 9  Set up and fault on a PKS protected page.
  *
@@ -53,8 +55,11 @@
 #define RUN_SINGLE		1
 #define ARM_CTX_SWITCH		2
 #define CHECK_CTX_SWITCH	3
+#define RUN_EXCEPTION		4
 #define RUN_CRASH_TEST		9
 
+DECLARE_PER_CPU(u32, pkrs_cache);
+
 static struct dentry *pks_test_dentry;
 static bool crash_armed;
 
@@ -65,8 +70,71 @@ static int prev_fault_cnt;
 
 struct pks_test_ctx {
 	int pkey;
+	bool pass;
 	char data[64];
 };
+static struct pks_test_ctx *test_exception_ctx;
+
+static bool check_pkey_val(u32 pk_reg, int pkey, u32 expected)
+{
+	pk_reg = (pk_reg >> PKR_PKEY_SHIFT(pkey)) & PKEY_ACCESS_MASK;
+	return (pk_reg == expected);
+}
+
+/*
+ * Check if the register @pkey value matches @expected value
+ *
+ * Both the cached and actual MSR must match.
+ */
+static bool check_pkrs(int pkey, u32 expected)
+{
+	bool ret = true;
+	u64 pkrs;
+	u32 *tmp_cache;
+
+	tmp_cache = get_cpu_ptr(&pkrs_cache);
+	if (!check_pkey_val(*tmp_cache, pkey, expected))
+		ret = false;
+	put_cpu_ptr(tmp_cache);
+
+	rdmsrl(MSR_IA32_PKRS, pkrs);
+	if (!check_pkey_val(pkrs, pkey, expected))
+		ret = false;
+
+	return ret;
+}
+
+static void check_exception(u32 thread_pkrs)
+{
+	/* Check the thread saved state */
+	if (!check_pkey_val(thread_pkrs, test_armed_key, PKEY_DISABLE_WRITE)) {
+		pr_err("     FAIL: checking ept_regs->thread_pkrs\n");
+		test_exception_ctx->pass = false;
+	}
+
+	/* Check that the exception state has disabled access */
+	if (!check_pkrs(test_armed_key, PKEY_DISABLE_ACCESS)) {
+		pr_err("     FAIL: PKRS cache and MSR\n");
+		test_exception_ctx->pass = false;
+	}
+
+	/*
+	 * Ensure an update can occur during exception without affecting the
+	 * interrupted thread.  The interrupted thread is checked after
+	 * exception...
+	 */
+	pks_mk_readwrite(test_armed_key);
+	if (!check_pkrs(test_armed_key, 0)) {
+		pr_err("     FAIL: exception did not change register to 0\n");
+		test_exception_ctx->pass = false;
+	}
+	pks_mk_noaccess(test_armed_key);
+	if (!check_pkrs(test_armed_key, PKEY_DISABLE_ACCESS)) {
+		pr_err("     FAIL: exception did not change register to 0x%x\n",
+			PKEY_DISABLE_ACCESS);
+		test_exception_ctx->pass = false;
+	}
+}
 
 /*
  * pks_test_callback() is called by the fault handler to indicate it saw a PKey
@@ -82,6 +150,16 @@ bool pks_test_callback(struct pt_regs *regs)
 	bool armed = (test_armed_key != 0);
 	u32 pkrs = aux_pt_regs->pks_thread_pkrs;
 
+	if (test_exception_ctx) {
+		check_exception(pkrs);
+		/*
+		 * Stop this check directly within the exception because the
+		 * fault handler clean up code will call again while checking
+		 * the PMD entry and there is no need to check this again.
+		 */
+		test_exception_ctx = NULL;
+	}
+
 	if (armed) {
 		/* Enable read and write to stop faults */
 		aux_pt_regs->pks_thread_pkrs = pkey_update_pkval(pkrs,
@@ -240,6 +318,7 @@ static struct pks_test_ctx *alloc_ctx(u8 pkey)
 	}
 
 	ctx->pkey = pkey;
+	ctx->pass = true;
 	sprintf(ctx->data, "%s", "DEADBEEF");
 	return ctx;
 }
@@ -265,6 +344,69 @@ static bool run_single(void)
 	return rc;
 }
 
+static bool run_exception_test(void)
+{
+	void *ptr = NULL;
+	bool pass = true;
+	struct pks_test_ctx *ctx;
+
+	pr_info("     ***** BEGIN: exception checking\n");
+
+	ctx = alloc_ctx(PKS_KEY_TEST);
+	if (IS_ERR(ctx)) {
+		pr_err("     FAIL: no context\n");
+		pass = false;
+		goto result;
+	}
+	ctx->pass = true;
+
+	ptr = alloc_test_page(ctx->pkey);
+	if (!ptr) {
+		pr_err("     FAIL: no vmalloc page\n");
+		pass = false;
+		goto free_context;
+	}
+
+	pks_update_protection(ctx->pkey, PKEY_DISABLE_WRITE);
+
+	WRITE_ONCE(test_exception_ctx, ctx);
+	WRITE_ONCE(test_armed_key, ctx->pkey);
+
+	memcpy(ptr, ctx->data, 8);
+
+	if (!fault_caught()) {
+		pr_err("     FAIL: did not get an exception\n");
+		pass = false;
+	}
+
+	/*
+	 * NOTE The exception code has to enable access (b00) to keep the fault
+	 * from looping forever.  Therefore full access is seen here rather
+	 * than write disabled.
+	 *
+	 * Furthermore, check_exception() disabled access during the exception
+	 * so this is testing that the thread value was restored back to the
+	 * thread value.
+	 */
+	if (!check_pkrs(test_armed_key, 0)) {
+		pr_err("     FAIL: PKRS not restored\n");
+		pass = false;
+	}
+
+	if (!ctx->pass)
+		pass = false;
+
+	WRITE_ONCE(test_armed_key, 0);
+
+	vfree(ptr);
+free_context:
+	free_ctx(ctx);
+result:
+	pr_info("     ***** END: exception checking : %s\n",
+		 pass ? "PASS" : "FAIL");
+	return pass;
+}
+
 static void crash_it(void)
 {
 	struct pks_test_ctx *ctx;
@@ -427,6 +569,9 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
 		/* After context switch MSR should be restored */
 		check_ctx_switch(file);
 		break;
+	case RUN_EXCEPTION:
+		last_test_pass = run_exception_test();
+		break;
 	default:
 		last_test_pass = false;
 		break;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 28/44] mm/pkeys: Introduce pks_update_exception()
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (26 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 27/44] mm/pkeys: Add PKS exception test ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 29/44] mm/pkeys: Introduce PKS fault callbacks ira.weiny
                   ` (15 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Some PKS use cases will want to catch permissions violations and
optionally allow them.

pks_update_protection() updates the protection of the current running
context.  It will _not_ work to change the protections of a thread which
has been interrupted.  Therefore updating a thread from within an
exception is not possible with pks_update_protection().

Introduce pks_update_exception() to update the faulted threads protections
in addition to the current context.  A PKS fault callback can then be
used to adjust the permissions of the faulted thread as necessary.

Add documentation

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Remove the concept of abandoning a pkey in favor of using the
		custom fault handler via this new pks_update_exception()
		call
	Without an abandon call there is no need for an abandon mask on
		sched in, new thread creation, or within exceptions...
	This now lets all invalid access' fault
	Ensure that all entry points into the pks has feature checks...
	Place abandon fault check before the test callback to ensure
		testing does not detect the double fault of the abandon
		code and flag it incorrectly as a fault.
	Change return type of pks_handle_abandoned_pkeys() to bool
---
 Documentation/core-api/protection-keys.rst |  3 ++
 arch/x86/mm/pkeys.c                        | 49 +++++++++++++++++++---
 include/linux/pkeys.h                      |  5 +++
 3 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 115afc67153f..b89308bf117e 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -147,6 +147,9 @@ Changing permissions of individual keys
 .. kernel-doc:: include/linux/pks-keys.h
         :identifiers: pks_mk_readwrite pks_mk_noaccess
 
+.. kernel-doc:: arch/x86/mm/pkeys.c
+        :identifiers: pks_update_exception
+
 MSR details
 -----------
 
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 1da78580d6de..6723ae42732a 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -319,6 +319,15 @@ void pks_setup(void)
 	cr4_set_bits(X86_CR4_PKS);
 }
 
+static void __pks_update_protection(int pkey, u32 protection)
+{
+	u32 pkrs = current->thread.pks_saved_pkrs;
+
+	current->thread.pks_saved_pkrs = pkey_update_pkval(pkrs, pkey,
+							   protection);
+	pks_write_pkrs(current->thread.pks_saved_pkrs);
+}
+
 /*
  * Do not call this directly, see pks_mk*().
  *
@@ -332,18 +341,46 @@ void pks_setup(void)
  */
 void pks_update_protection(int pkey, u32 protection)
 {
-	u32 pkrs;
-
 	if (!cpu_feature_enabled(X86_FEATURE_PKS))
 		return;
 
-	pkrs = current->thread.pks_saved_pkrs;
-	current->thread.pks_saved_pkrs = pkey_update_pkval(pkrs, pkey,
-							   protection);
 	preempt_disable();
-	pks_write_pkrs(current->thread.pks_saved_pkrs);
+	__pks_update_protection(pkey, protection);
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(pks_update_protection);
 
+/**
+ * pks_update_exception() - Update the protections of a faulted thread
+ *
+ * @regs: Faulting thread registers
+ * @pkey: pkey to update
+ * @protection: protection bits to use.
+ *
+ * CONTEXT: Exception
+ *
+ * pks_update_protection() updates the protection of the current running
+ * context.  It will not work to change the protections of a thread which has
+ * been interrupted.  If a PKS fault callback fires it may want to update the
+ * faulted threads protections in addition to it's own.
+ *
+ * Use pks_update_exception() to update the faulted threads protections
+ * in addition to the current context.
+ */
+void pks_update_exception(struct pt_regs *regs, int pkey, u32 protection)
+{
+	struct pt_regs_extended *ept_regs;
+	u32 old;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	__pks_update_protection(pkey, protection);
+
+	ept_regs = to_extended_pt_regs(regs);
+	old = ept_regs->aux.pks_thread_pkrs;
+	ept_regs->aux.pks_thread_pkrs = pkey_update_pkval(old, pkey, protection);
+}
+EXPORT_SYMBOL_GPL(pks_update_exception);
+
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 5f4965f5449b..c318d97f5da8 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -56,6 +56,7 @@ static inline bool arch_pkeys_enabled(void)
 #include <uapi/asm-generic/mman-common.h>
 
 void pks_update_protection(int pkey, u32 protection);
+void pks_update_exception(struct pt_regs *regs, int pkey, u32 protection);
 
 /**
  * pks_mk_noaccess() - Disable all access to the domain
@@ -85,6 +86,10 @@ static inline void pks_mk_readwrite(int pkey)
 
 static inline void pks_mk_noaccess(int pkey) {}
 static inline void pks_mk_readwrite(int pkey) {}
+static inline void pks_update_exception(struct pt_regs *regs,
+					int pkey,
+					u32 protection)
+{ }
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 29/44] mm/pkeys: Introduce PKS fault callbacks
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (27 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 28/44] mm/pkeys: Introduce pks_update_exception() ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 30/44] mm/pkeys: Test setting a PKS key in a custom fault callback ira.weiny
                   ` (14 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Rick Edgecombe <rick.p.edgecombe@intel.com>

Some PKS keys will want special handling on accesses that violate the
Pkey permissions.  One of these is PMEM which will want to have a mode
that logs the access violation, disables protection, and continues
rather than oops'ing the machine.

Provide an API to set callbacks for individual Pkeys.  Call these
through pks_handle_key_fault() which is called in the fault handler.

Since PKS faults do not provide the key that faulted, this information
needs to be recovered by walking the page tables and extracting it from
the leaf entry.  The key can then be used to call the specific user
defined callback.

This infrastructure could be used to implement the PKS testing code.
Unfortunately, this would limit the ability to test this code itself as
well as limit the testing code to a single Pkey.  Because
pks_test_callback() is zero overhead if CONFIG_PKS_TEST is not specified
it is left as a separate hook in the fault handler.

Add documentation.

Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
Changes for V8:
	Add pt_regs to the callback signature so that
		pks_update_exception() can be called if needed.
	Update commit message
	Determine if page is large prior to not present
	Update commit message with more clarity as to why this was kept
		separate from pks_abandon_protections() and
		pks_test_callback()
	Embed documentation in c file.
	Move handle_pks_key_fault() to pkeys.c
		s/handle_pks_key_fault/pks_handle_key_fault/
		This consolidates the PKS code nicely
	Add feature check to pks_handle_key_fault()
	From Rick Edgecombe
		Fix key value check
	From kernel test robot
		Add static to handle_pks_key_fault

Changes for V7:
	New patch
---
 Documentation/core-api/protection-keys.rst |  9 ++-
 arch/x86/include/asm/pks.h                 |  9 +++
 arch/x86/mm/fault.c                        |  3 +
 arch/x86/mm/pkeys.c                        | 86 ++++++++++++++++++++++
 include/linux/pkeys.h                      |  3 +
 include/linux/pks-keys.h                   |  2 +
 6 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index b89308bf117e..267efa2112e7 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -115,7 +115,8 @@ Overview
 
 Similar to user space pkeys, supervisor pkeys allow additional protections to
 be defined for a supervisor mappings.  Unlike user space pkeys, violations of
-these protections result in a kernel oops.
+these protections result in a kernel oops unless a PKS fault handler is
+provided which handles the fault.
 
 Supervisor Memory Protection Keys (PKS) is a feature which is found on Intel's
 Sapphire Rapids (and later) "Scalable Processor" Server CPUs.  It will also be
@@ -150,6 +151,12 @@ Changing permissions of individual keys
 .. kernel-doc:: arch/x86/mm/pkeys.c
         :identifiers: pks_update_exception
 
+Overriding Default Fault Behavior
+---------------------------------
+
+.. kernel-doc:: arch/x86/mm/pkeys.c
+        :doc: DEFINE_PKS_FAULT_CALLBACK
+
 MSR details
 -----------
 
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index 065386c8bf37..55541bb64d08 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -9,6 +9,8 @@ void pks_write_current(void);
 void pks_save_pt_regs(struct pt_regs *regs);
 void pks_restore_pt_regs(struct pt_regs *regs);
 void pks_dump_fault_info(struct pt_regs *regs);
+bool pks_handle_key_fault(struct pt_regs *regs, unsigned long hw_error_code,
+			  unsigned long address);
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
@@ -18,6 +20,13 @@ static inline void pks_save_pt_regs(struct pt_regs *regs) { }
 static inline void pks_restore_pt_regs(struct pt_regs *regs) { }
 static inline void pks_dump_fault_info(struct pt_regs *regs) { }
 
+static inline bool pks_handle_key_fault(struct pt_regs *regs,
+					unsigned long hw_error_code,
+					unsigned long address)
+{
+	return false;
+}
+
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 697c06f08103..e378573d97a7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1162,6 +1162,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
 		 */
 		WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS));
 
+		if (pks_handle_key_fault(regs, hw_error_code, address))
+			return;
+
 		/*
 		 * If a protection key exception occurs it could be because a PKS test
 		 * is running.  If so, pks_test_callback() will clear the protection
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 6723ae42732a..531cf6c74ad7 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -11,6 +11,7 @@
 #include <asm/cpufeature.h>             /* boot_cpu_has, ...            */
 #include <asm/mmu_context.h>            /* vma_pkey()                   */
 #include <asm/pks.h>
+#include <asm/trap_pf.h>		/* X86_PF_WRITE */
 
 int __execute_only_pkey(struct mm_struct *mm)
 {
@@ -212,6 +213,91 @@ u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
 
 __static_or_pks_test DEFINE_PER_CPU(u32, pkrs_cache);
 
+/**
+ * DOC: DEFINE_PKS_FAULT_CALLBACK
+ *
+ * Users may also provide a fault handler which can handle a fault differently
+ * than an oops.  For example if 'MY_FEATURE' wanted to define a handler they
+ * can do so by adding the coresponding entry to the pks_key_callbacks array.
+ *
+ * .. code-block:: c
+ *
+ *	#ifdef CONFIG_MY_FEATURE
+ *	bool my_feature_pks_fault_callback(struct pt_regs *regs,
+ *					   unsigned long address, bool write)
+ *	{
+ *		if (my_feature_fault_is_ok)
+ *			return true;
+ *		return false;
+ *	}
+ *	#endif
+ *
+ *	static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = {
+ *		[PKS_KEY_DEFAULT]            = NULL,
+ *	#ifdef CONFIG_MY_FEATURE
+ *		[PKS_KEY_PGMAP_PROTECTION]   = my_feature_pks_fault_callback,
+ *	#endif
+ *	};
+ */
+static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = { 0 };
+
+static bool pks_call_fault_callback(struct pt_regs *regs, unsigned long address,
+				    bool write, u16 key)
+{
+	if (key >= PKS_KEY_NR_CONSUMERS)
+		return false;
+
+	if (pks_key_callbacks[key])
+		return pks_key_callbacks[key](regs, address, write);
+
+	return false;
+}
+
+bool pks_handle_key_fault(struct pt_regs *regs, unsigned long hw_error_code,
+			  unsigned long address)
+{
+	bool write;
+	pgd_t pgd;
+	p4d_t p4d;
+	pud_t pud;
+	pmd_t pmd;
+	pte_t pte;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return false;
+
+	write = (hw_error_code & X86_PF_WRITE);
+
+	pgd = READ_ONCE(*(init_mm.pgd + pgd_index(address)));
+	if (!pgd_present(pgd))
+		return false;
+
+	p4d = READ_ONCE(*p4d_offset(&pgd, address));
+	if (p4d_large(p4d))
+		return pks_call_fault_callback(regs, address, write,
+					       pte_flags_pkey(p4d_val(p4d)));
+	if (!p4d_present(p4d))
+		return false;
+
+	pud = READ_ONCE(*pud_offset(&p4d, address));
+	if (pud_large(pud))
+		return pks_call_fault_callback(regs, address, write,
+					       pte_flags_pkey(pud_val(pud)));
+	if (!pud_present(pud))
+		return false;
+
+	pmd = READ_ONCE(*pmd_offset(&pud, address));
+	if (pmd_large(pmd))
+		return pks_call_fault_callback(regs, address, write,
+					       pte_flags_pkey(pmd_val(pmd)));
+	if (!pmd_present(pmd))
+		return false;
+
+	pte = READ_ONCE(*pte_offset_kernel(&pmd, address));
+	return pks_call_fault_callback(regs, address, write,
+				       pte_flags_pkey(pte_val(pte)));
+}
+
 /*
  * pks_write_pkrs() - Write the pkrs of the current CPU
  * @new_pkrs: New value to write to the current CPU register
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index c318d97f5da8..a53e4f2c41af 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -82,6 +82,9 @@ static inline void pks_mk_readwrite(int pkey)
 	pks_update_protection(pkey, PKEY_READ_WRITE);
 }
 
+typedef bool (*pks_key_callback)(struct pt_regs *regs, unsigned long address,
+				 bool write);
+
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 static inline void pks_mk_noaccess(int pkey) {}
diff --git a/include/linux/pks-keys.h b/include/linux/pks-keys.h
index 69a0be979515..a3fcd8df8688 100644
--- a/include/linux/pks-keys.h
+++ b/include/linux/pks-keys.h
@@ -27,6 +27,7 @@
  *	{
  *		PKS_KEY_DEFAULT         = 0,
  *		PKS_KEY_MY_FEATURE      = 1,
+ *		PKS_KEY_NR_CONSUMERS    = 2,
  *	}
  *
  *	#define PKS_INIT_VALUE (PKR_RW_KEY(PKS_KEY_DEFAULT)		|
@@ -43,6 +44,7 @@
 enum pks_pkey_consumers {
 	PKS_KEY_DEFAULT		= 0, /* Must be 0 for default PTE values */
 	PKS_KEY_TEST		= 1,
+	PKS_KEY_NR_CONSUMERS	= 2,
 };
 
 #define PKS_INIT_VALUE (PKR_RW_KEY(PKS_KEY_DEFAULT)		| \
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 30/44] mm/pkeys: Test setting a PKS key in a custom fault callback
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (28 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 29/44] mm/pkeys: Introduce PKS fault callbacks ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-02-01  0:55   ` Edgecombe, Rick P
  2022-02-01 17:42   ` Edgecombe, Rick P
  2022-01-27 17:54 ` [PATCH V8 31/44] mm/pkeys: Add pks_available() ira.weiny
                   ` (13 subsequent siblings)
  43 siblings, 2 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

A common use case for the custom fault callbacks will be for the
callback to warn of the violation and relax the permissions rather than
crash the kernel.

An example of this is for non-security use cases which may want to relax
the permissions and flag the invalid access rather than strictly crash
the kernel.  In this case the user defines a callback which detects this
condition, reports the error, and allows for continued operation by
handling the fault through the pks_update_exception().

Add a test which does this.

	$ echo 5 > /sys/kernel/debug/x86/run_pks
	$ cat /sys/kernel/debug/x86/run_pks
	PASS

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	New test developed just to double check for regressions while
	reworking the code.
---
 arch/x86/include/asm/pks.h |  2 ++
 arch/x86/mm/pkeys.c        |  6 +++-
 lib/pks/pks_test.c         | 74 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index 55541bb64d08..e09934c540e2 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -34,6 +34,8 @@ static inline bool pks_handle_key_fault(struct pt_regs *regs,
 
 bool pks_test_callback(struct pt_regs *regs);
 #define __static_or_pks_test
+bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
+			     bool write);
 
 #else /* !CONFIG_PKS_TEST */
 
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 531cf6c74ad7..f30ac8215785 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -239,7 +239,11 @@ __static_or_pks_test DEFINE_PER_CPU(u32, pkrs_cache);
  *	#endif
  *	};
  */
-static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = { 0 };
+static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = {
+#ifdef CONFIG_PKS_TEST
+	[PKS_KEY_TEST]		= pks_test_fault_callback,
+#endif
+};
 
 static bool pks_call_fault_callback(struct pt_regs *regs, unsigned long address,
 				    bool write, u16 key)
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 008a1079579d..1528df0bb283 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -19,6 +19,7 @@
  * * 3  Check the context armed in '2' to ensure the MSR value was preserved
  * * 4  Test that the exception thread PKRS remains independent of the
  *      interrupted threads PKRS
+ * * 5  Test setting a key to RD/WR in a fault callback to abandon a key
  * * 8  Loop through all CPUs, report the msr, and check against the default.
  * * 9  Set up and fault on a PKS protected page.
  *
@@ -56,6 +57,7 @@
 #define ARM_CTX_SWITCH		2
 #define CHECK_CTX_SWITCH	3
 #define RUN_EXCEPTION		4
+#define RUN_FAULT_ABANDON	5
 #define RUN_CRASH_TEST		9
 
 DECLARE_PER_CPU(u32, pkrs_cache);
@@ -519,6 +521,75 @@ static void check_ctx_switch(struct file *file)
 	}
 }
 
+struct {
+	struct pks_test_ctx *ctx;
+	void *test_page;
+	bool armed;
+	bool callback_seen;
+} fault_callback_ctx;
+
+bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
+			     bool write)
+{
+	if (!fault_callback_ctx.armed)
+		return false;
+
+	fault_callback_ctx.armed = false;
+	fault_callback_ctx.callback_seen = true;
+
+	pks_update_exception(regs, fault_callback_ctx.ctx->pkey, 0);
+
+	return true;
+}
+
+static bool run_fault_clear_test(void)
+{
+	struct pks_test_ctx *ctx;
+	void *test_page;
+	bool rc = true;
+
+	ctx = alloc_ctx(PKS_KEY_TEST);
+	if (IS_ERR(ctx))
+		return false;
+
+	test_page = alloc_test_page(ctx->pkey);
+	if (!test_page) {
+		pr_err("Failed to vmalloc page???\n");
+		free_ctx(ctx);
+		return false;
+	}
+
+	test_armed_key = PKS_KEY_TEST;
+	fault_callback_ctx.ctx = ctx;
+	fault_callback_ctx.test_page = test_page;
+	fault_callback_ctx.armed = true;
+	fault_callback_ctx.callback_seen = false;
+
+	pks_mk_noaccess(test_armed_key);
+
+	/* fault */
+	memcpy(test_page, ctx->data, 8);
+
+	if (!fault_callback_ctx.callback_seen) {
+		pr_err("Failed to see the callback\n");
+		rc = false;
+		goto done;
+	}
+
+	/* no fault */
+	fault_callback_ctx.callback_seen = false;
+	memcpy(test_page, ctx->data, 8);
+
+	if (fault_caught() || fault_callback_ctx.callback_seen) {
+		pr_err("The key failed to be set RD/WR in the callback\n");
+		return false;
+	}
+
+done:
+	free_ctx(ctx);
+	return rc;
+}
+
 static ssize_t pks_read_file(struct file *file, char __user *user_buf,
 			     size_t count, loff_t *ppos)
 {
@@ -572,6 +643,9 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
 	case RUN_EXCEPTION:
 		last_test_pass = run_exception_test();
 		break;
+	case RUN_FAULT_ABANDON:
+		last_test_pass = run_fault_clear_test();
+		break;
 	default:
 		last_test_pass = false;
 		break;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 31/44] mm/pkeys: Add pks_available()
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (29 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 30/44] mm/pkeys: Test setting a PKS key in a custom fault callback ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 32/44] memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION ira.weiny
                   ` (12 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The PKS code calls will not fail if they are called and a CPU does not
support the PKS feature.  There will be no protection but the API is
safe to call.  However, adding the overhead of these calls on CPUs which
don't support PKS is inefficient

Define pks_available() to allow users to check if PKS is enabled on the
current system.  If not they can chose to optimize around the PKS calls.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	s/pks_enabled/pks_available
---
 Documentation/core-api/protection-keys.rst |  3 +++
 arch/x86/mm/pkeys.c                        | 10 ++++++++++
 include/linux/pkeys.h                      |  6 ++++++
 3 files changed, 19 insertions(+)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 267efa2112e7..27c9701d4aeb 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -151,6 +151,9 @@ Changing permissions of individual keys
 .. kernel-doc:: arch/x86/mm/pkeys.c
         :identifiers: pks_update_exception
 
+.. kernel-doc:: arch/x86/mm/pkeys.c
+        :identifiers: pks_available
+
 Overriding Default Fault Behavior
 ---------------------------------
 
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index f30ac8215785..fa71037c1dd0 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -418,6 +418,16 @@ static void __pks_update_protection(int pkey, u32 protection)
 	pks_write_pkrs(current->thread.pks_saved_pkrs);
 }
 
+/**
+ * pks_available() - Is PKS available on this system
+ *
+ * Return if PKS is currently supported and enabled on this system.
+ */
+bool pks_available(void)
+{
+	return cpu_feature_enabled(X86_FEATURE_PKS);
+}
+
 /*
  * Do not call this directly, see pks_mk*().
  *
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index a53e4f2c41af..ec5463c373a1 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -55,6 +55,7 @@ static inline bool arch_pkeys_enabled(void)
 
 #include <uapi/asm-generic/mman-common.h>
 
+bool pks_available(void);
 void pks_update_protection(int pkey, u32 protection);
 void pks_update_exception(struct pt_regs *regs, int pkey, u32 protection);
 
@@ -87,6 +88,11 @@ typedef bool (*pks_key_callback)(struct pt_regs *regs, unsigned long address,
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
+static inline bool pks_available(void)
+{
+	return false;
+}
+
 static inline void pks_mk_noaccess(int pkey) {}
 static inline void pks_mk_readwrite(int pkey) {}
 static inline void pks_update_exception(struct pt_regs *regs,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 32/44] memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (30 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 31/44] mm/pkeys: Add pks_available() ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-02-04 15:49   ` Dan Williams
  2022-01-27 17:54 ` [PATCH V8 33/44] memremap_pages: Introduce pgmap_protection_available() ira.weiny
                   ` (11 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The persistent memory (PMEM) driver uses the memremap_pages facility to
provide 'struct page' metadata (vmemmap) for PMEM.  Given that PMEM
capacity maybe orders of magnitude higher capacity than System RAM it
presents a large vulnerability surface to stray writes.  Unlike stray
writes to System RAM, which may result in a crash or other undesirable
behavior, stray writes to PMEM additionally are more likely to result in
permanent data loss. Reboot is not a remediation for PMEM corruption
like it is for System RAM.

Given that PMEM access from the kernel is limited to a constrained set
of locations (PMEM driver, Filesystem-DAX, and direct-I/O to a DAX
page), it is amenable to supervisor pkey protection.

Not all systems with PMEM will want additional protections.  Therefore,
add a Kconfig option for the user to configure the additional devmap
protections.

Only systems with supervisor protection keys (PKS) are able to support
this new protection so depend on ARCH_HAS_SUPERVISOR_PKEYS.
Furthermore, select ARCH_ENABLE_SUPERVISOR_PKEYS to ensure that the
architecture support is enabled if PMEM is the only use case.

Only PMEM which is advertised to the memory subsystem needs this
protection.  Therefore, the feature depends on NVDIMM_PFN.

A default of (NVDIMM_PFN && ARCH_HAS_SUPERVISOR_PKEYS) was suggested but
logically that is the same as saying default 'yes' because both
NVDIMM_PFN and ARCH_HAS_SUPERVISOR_PKEYS are required.  Therefore a
default of 'yes' is used.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Split this out from
		[PATCH V7 13/18] memremap_pages: Add access protection via supervisor Protection Keys (PKS)
---
 mm/Kconfig | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 46f2bb15aa4e..67e0264acf7d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -776,6 +776,24 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config DEVMAP_ACCESS_PROTECTION
+	bool "Access protection for memremap_pages()"
+	depends on NVDIMM_PFN
+	depends on ARCH_HAS_SUPERVISOR_PKEYS
+	select ARCH_ENABLE_SUPERVISOR_PKEYS
+	default y
+
+	help
+	  Enable extra protections on device memory.  This protects against
+	  unintended access to devices such as a stray writes.  This feature is
+	  particularly useful to protect against corruption of persistent
+	  memory.
+
+	  This depends on architecture support of supervisor PKeys and has no
+	  overhead if the architecture does not support them.
+
+	  If you have persistent memory say 'Y'.
+
 config DEV_PAGEMAP_OPS
 	bool
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 33/44] memremap_pages: Introduce pgmap_protection_available()
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (31 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 32/44] memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-02-04 16:19   ` Dan Williams
  2022-01-27 17:54 ` [PATCH V8 34/44] memremap_pages: Introduce a PGMAP_PROTECTION flag ira.weiny
                   ` (10 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Users will need to specify that they want their dev_pagemap pages
protected by specifying a flag in (struct dev_pagemap)->flags.  However,
it is more efficient to know if that protection is available prior to
requesting it and failing the mapping.

Define pgmap_protection_available() for users to check if protection is
available to be used.  The name of pgmap_protection_available() was
specifically chosen to isolate the implementation of the protection from
higher level users.  However, the current implementation simply calls
pks_available() to determine if it can support protection.

It was considered to have users specify the flag and check if the
dev_pagemap object returned was protected or not.  But this was
considered less efficient than a direct check beforehand.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Split this out to it's own patch.
	s/pgmap_protection_enabled/pgmap_protection_available
---
 include/linux/mm.h | 13 +++++++++++++
 mm/memremap.c      | 11 +++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e1a84b1e6787..2ae99bee6e82 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1143,6 +1143,19 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
 }
 
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+
+bool pgmap_protection_available(void);
+
+#else
+
+static inline bool pgmap_protection_available(void)
+{
+	return false;
+}
+
+#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
+
 /* 127: arbitrary random number, small enough to assemble well */
 #define folio_ref_zero_or_close_to_overflow(folio) \
 	((unsigned int) folio_ref_count(folio) + 127u <= 127u)
diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11f..c13b3b8a0048 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -6,6 +6,7 @@
 #include <linux/memory_hotplug.h>
 #include <linux/mm.h>
 #include <linux/pfn_t.h>
+#include <linux/pkeys.h>
 #include <linux/swap.h>
 #include <linux/mmzone.h>
 #include <linux/swapops.h>
@@ -63,6 +64,16 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 }
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+
+bool pgmap_protection_available(void)
+{
+	return pks_available();
+}
+EXPORT_SYMBOL_GPL(pgmap_protection_available);
+
+#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
+
 static void pgmap_array_delete(struct range *range)
 {
 	xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 34/44] memremap_pages: Introduce a PGMAP_PROTECTION flag
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (32 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 33/44] memremap_pages: Introduce pgmap_protection_available() ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 35/44] memremap_pages: Introduce devmap_protected() ira.weiny
                   ` (9 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The persistent memory (PMEM) driver uses the memremap_pages facility to
provide 'struct page' metadata (vmemmap) for PMEM.  Given that PMEM
capacity maybe orders of magnitude higher capacity than System RAM it
presents a large vulnerability surface to stray writes.  Unlike stray
writes to System RAM, which may result in a crash or other undesirable
behavior, stray writes to PMEM additionally are more likely to result in
permanent data loss. Reboot is not a remediation for PMEM corruption
like it is for System RAM.

Given that PMEM access from the kernel is limited to a constrained set
of locations (PMEM driver, Filesystem-DAX, and direct-I/O to a DAX
page), it is amenable to supervisor pkey protection.

Some systems which have enabled DEVMAP_ACCESS_PROTECTION may not have
PMEM installed.  Or the PMEM may not be mapped into the direct map.

Also users other than PMEM of memremap_pages() will not want these pages
protected.

Define a new PGMAP flag, PGMAP_PROTECTION.  This can be passed in
(struct dev_pagemap)->flags when calling memremap_pages() to request
that the pages be protected.  Then use the flag to enable a static key.
The static key is used to optimize the protection away if no callers are
currently using protections.

Specifying this flag on a system which can't support protections will
fail.  Users are expected to check if protections are supported via
pgmap_protection_available() prior to asking for them.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Split this out into it's own patch
---
 include/linux/memremap.h |  1 +
 mm/memremap.c            | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1fafcc38acba..84402f73712c 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -80,6 +80,7 @@ struct dev_pagemap_ops {
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
+#define PGMAP_PROTECTION	(1 << 1)
 
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
diff --git a/mm/memremap.c b/mm/memremap.c
index c13b3b8a0048..a74d985a1908 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -66,12 +66,39 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
 
 #ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
 
+/*
+ * Note; all devices which have asked for protections share the same key.  The
+ * key may, or may not, have been provided by the core.  If not, protection
+ * will be disabled.  The key acquisition is attempted when the first ZONE
+ * DEVICE requests it and freed when all zones have been unmapped.
+ *
+ * Also this must be EXPORT_SYMBOL rather than EXPORT_SYMBOL_GPL because it is
+ * intended to be used in the kmap API.
+ */
+DEFINE_STATIC_KEY_FALSE(dev_pgmap_protection_static_key);
+EXPORT_SYMBOL(dev_pgmap_protection_static_key);
+
+static void devmap_protection_enable(void)
+{
+	static_branch_inc(&dev_pgmap_protection_static_key);
+}
+
+static void devmap_protection_disable(void)
+{
+	static_branch_dec(&dev_pgmap_protection_static_key);
+}
+
 bool pgmap_protection_available(void)
 {
 	return pks_available();
 }
 EXPORT_SYMBOL_GPL(pgmap_protection_available);
 
+#else /* !CONFIG_DEVMAP_ACCESS_PROTECTION */
+
+static void devmap_protection_enable(void) { }
+static void devmap_protection_disable(void) { }
+
 #endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
 
 static void pgmap_array_delete(struct range *range)
@@ -173,6 +200,9 @@ void memunmap_pages(struct dev_pagemap *pgmap)
 
 	WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
 	devmap_managed_enable_put(pgmap);
+
+	if (pgmap->flags & PGMAP_PROTECTION)
+		devmap_protection_disable();
 }
 EXPORT_SYMBOL_GPL(memunmap_pages);
 
@@ -319,6 +349,12 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 	if (WARN_ONCE(!nr_range, "nr_range must be specified\n"))
 		return ERR_PTR(-EINVAL);
 
+	if (pgmap->flags & PGMAP_PROTECTION) {
+		if (!pgmap_protection_available())
+			return ERR_PTR(-EINVAL);
+		devmap_protection_enable();
+	}
+
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 35/44] memremap_pages: Introduce devmap_protected()
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (33 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 34/44] memremap_pages: Introduce a PGMAP_PROTECTION flag ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-01-27 17:54 ` [PATCH V8 36/44] memremap_pages: Reserve a PKS PKey for eventual use by PMEM ira.weiny
                   ` (8 subsequent siblings)
  43 siblings, 0 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Users of protected dev_pagemaps can check the PGMAP_PROTECTION flag to
see if the devmap is protected.  However, most callers operate on struct
page's not the pagemap directly.

Define devmap_protected() to determine if a page is part of a
dev_pagemap mapping and if so if the page is protected by the additional
protections.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 include/linux/mm.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2ae99bee6e82..6e4a2758e3d3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1145,6 +1145,23 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 
 #ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
 
+DECLARE_STATIC_KEY_FALSE(dev_pgmap_protection_static_key);
+
+/*
+ * devmap_protected() requires a reference on the page to ensure there is no
+ * races with dev_pagemap tear down.
+ */
+static inline bool devmap_protected(struct page *page)
+{
+	if (!static_branch_unlikely(&dev_pgmap_protection_static_key))
+		return false;
+	if (!is_zone_device_page(page))
+		return false;
+	if (page->pgmap->flags & PGMAP_PROTECTION)
+		return true;
+	return false;
+}
+
 bool pgmap_protection_available(void);
 
 #else
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 36/44] memremap_pages: Reserve a PKS PKey for eventual use by PMEM
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (34 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 35/44] memremap_pages: Introduce devmap_protected() ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-02-01 18:35   ` Edgecombe, Rick P
  2022-01-27 17:54 ` [PATCH V8 37/44] memremap_pages: Set PKS PKey in PTEs if PGMAP_PROTECTIONS is requested ira.weiny
                   ` (7 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

The persistent memory (PMEM) driver uses the memremap_pages facility to
provide 'struct page' metadata (vmemmap) for PMEM.  Given that PMEM
capacity maybe orders of magnitude higher capacity than System RAM it
presents a large vulnerability surface to stray writes.  Unlike stray
writes to System RAM, which may result in a crash or other undesirable
behavior, stray writes to PMEM additionally are more likely to result in
permanent data loss. Reboot is not a remediation for PMEM corruption
like it is for System RAM.

Given that PMEM access from the kernel is limited to a constrained set
of locations (PMEM driver, Filesystem-DAX, and direct-I/O to a DAX
page), it is amenable to supervisor pkey protection.  PMEM uses the
memmap facility to map it's pages into the direct map.

Reserve a PKey for use by the memmap facility.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 include/linux/pks-keys.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/pks-keys.h b/include/linux/pks-keys.h
index a3fcd8df8688..46bb9a18da5a 100644
--- a/include/linux/pks-keys.h
+++ b/include/linux/pks-keys.h
@@ -42,14 +42,16 @@
  *
  */
 enum pks_pkey_consumers {
-	PKS_KEY_DEFAULT		= 0, /* Must be 0 for default PTE values */
-	PKS_KEY_TEST		= 1,
-	PKS_KEY_NR_CONSUMERS	= 2,
+	PKS_KEY_DEFAULT			= 0, /* Must be 0 for default PTE values */
+	PKS_KEY_TEST			= 1,
+	PKS_KEY_PGMAP_PROTECTION	= 2,
+	PKS_KEY_NR_CONSUMERS		= 3,
 };
 
 #define PKS_INIT_VALUE (PKR_RW_KEY(PKS_KEY_DEFAULT)		| \
 			PKR_AD_KEY(PKS_KEY_TEST)	| \
-			PKR_AD_KEY(2)	| PKR_AD_KEY(3)		| \
+			PKR_AD_KEY(PKS_KEY_PGMAP_PROTECTION)	| \
+			PKR_AD_KEY(3)	| \
 			PKR_AD_KEY(4)	| PKR_AD_KEY(5)		| \
 			PKR_AD_KEY(6)	| PKR_AD_KEY(7)		| \
 			PKR_AD_KEY(8)	| PKR_AD_KEY(9)		| \
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 37/44] memremap_pages: Set PKS PKey in PTEs if PGMAP_PROTECTIONS is requested
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (35 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 36/44] memremap_pages: Reserve a PKS PKey for eventual use by PMEM ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-02-04 17:41   ` Dan Williams
  2022-01-27 17:54 ` [PATCH V8 38/44] memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls ira.weiny
                   ` (6 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

When the user requests protections the dev_pagemap mappings need to have
a PKEY set.

Define devmap_protection_adjust_pgprot() to add the PKey to the page
protections.  Call it when PGMAP_PROTECTIONS is requested when remapping
pages.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 mm/memremap.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/memremap.c b/mm/memremap.c
index a74d985a1908..d3e6f328a711 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -83,6 +83,14 @@ static void devmap_protection_enable(void)
 	static_branch_inc(&dev_pgmap_protection_static_key);
 }
 
+static pgprot_t devmap_protection_adjust_pgprot(pgprot_t prot)
+{
+	pgprotval_t val;
+
+	val = pgprot_val(prot);
+	return __pgprot(val | _PAGE_PKEY(PKS_KEY_PGMAP_PROTECTION));
+}
+
 static void devmap_protection_disable(void)
 {
 	static_branch_dec(&dev_pgmap_protection_static_key);
@@ -99,6 +107,10 @@ EXPORT_SYMBOL_GPL(pgmap_protection_available);
 static void devmap_protection_enable(void) { }
 static void devmap_protection_disable(void) { }
 
+static pgprot_t devmap_protection_adjust_pgprot(pgprot_t prot)
+{
+	return prot;
+}
 #endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
 
 static void pgmap_array_delete(struct range *range)
@@ -353,6 +365,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 		if (!pgmap_protection_available())
 			return ERR_PTR(-EINVAL);
 		devmap_protection_enable();
+		params.pgprot = devmap_protection_adjust_pgprot(params.pgprot);
 	}
 
 	switch (pgmap->type) {
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 38/44] memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (36 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 37/44] memremap_pages: Set PKS PKey in PTEs if PGMAP_PROTECTIONS is requested ira.weiny
@ 2022-01-27 17:54 ` ira.weiny
  2022-02-04 18:35   ` Dan Williams
  2022-01-27 17:55 ` [PATCH V8 39/44] memremap_pages: Add memremap.pks_fault_mode ira.weiny
                   ` (5 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:54 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Users will need a way to flag valid access to pages which have been
protected with PGMAP protections.  Provide this by defining pgmap_mk_*()
accessor functions.

pgmap_mk_{readwrite|noaccess}() take a struct page for convenience.
They determine if the page is protected by dev_pagemap protections.  If
so, they perform the requested operation.

In addition, the lower level __pgmap_* functions are exported.  They
take the dev_pagemap object directly for internal users who have
knowledge of the of the dev_pagemap.

All changes in the protections must be through the above calls.  They
abstract the protection implementation (currently the PKS api) from the
upper layer users.

Furthermore, the calls are nestable by the use of a per task reference
count.  This ensures that the first call to re-enable protection does
not 'break' the last access of the device memory.

Access to device memory during exceptions (#PF) is expected only from
user faults.  Therefore there is no need to maintain the reference count
when entering or exiting exceptions.  However, reference counting will
occur during the exception.  Recall that protection is automatically
enabled during exceptions by the PKS core.[1]

NOTE: It is not anticipated that any code paths will directly nest these
calls.  For this reason multiple reviewers, including Dan and Thomas,
asked why this reference counting was needed at this level rather than
in a higher level call such as kmap_{atomic,local_page}().  The reason
is that pgmap_mk_readwrite() could nest with regards to other callers of
pgmap_mk_*() such as kmap_{atomic,local_page}().  Therefore push this
reference counting to the lower level and just ensure that these calls
are nestable.

[1] https://lore.kernel.org/lkml/20210401225833.566238-9-ira.weiny@intel.com/

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Split these functions into their own patch.
		This helps to clarify the commit message and usage.
---
 include/linux/mm.h    | 34 ++++++++++++++++++++++++++++++++++
 include/linux/sched.h |  7 +++++++
 init/init_task.c      |  3 +++
 mm/memremap.c         | 14 ++++++++++++++
 4 files changed, 58 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6e4a2758e3d3..60044de77c54 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,10 +1162,44 @@ static inline bool devmap_protected(struct page *page)
 	return false;
 }
 
+void __pgmap_mk_readwrite(struct dev_pagemap *pgmap);
+void __pgmap_mk_noaccess(struct dev_pagemap *pgmap);
+
+static inline bool pgmap_check_pgmap_prot(struct page *page)
+{
+	if (!devmap_protected(page))
+		return false;
+
+	/*
+	 * There is no known use case to change permissions in an irq for pgmap
+	 * pages
+	 */
+	lockdep_assert_in_irq();
+	return true;
+}
+
+static inline void pgmap_mk_readwrite(struct page *page)
+{
+	if (!pgmap_check_pgmap_prot(page))
+		return;
+	__pgmap_mk_readwrite(page->pgmap);
+}
+static inline void pgmap_mk_noaccess(struct page *page)
+{
+	if (!pgmap_check_pgmap_prot(page))
+		return;
+	__pgmap_mk_noaccess(page->pgmap);
+}
+
 bool pgmap_protection_available(void);
 
 #else
 
+static inline void __pgmap_mk_readwrite(struct dev_pagemap *pgmap) { }
+static inline void __pgmap_mk_noaccess(struct dev_pagemap *pgmap) { }
+static inline void pgmap_mk_readwrite(struct page *page) { }
+static inline void pgmap_mk_noaccess(struct page *page) { }
+
 static inline bool pgmap_protection_available(void)
 {
 	return false;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f5b2be39a78c..5020ed7e67b7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1492,6 +1492,13 @@ struct task_struct {
 	struct callback_head		l1d_flush_kill;
 #endif
 
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+	/*
+	 * NOTE: pgmap_prot_count is modified within a single thread of
+	 * execution.  So it does not need to be atomic_t.
+	 */
+	u32                             pgmap_prot_count;
+#endif
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/init/init_task.c b/init/init_task.c
index 73cc8f03511a..948b32cf8139 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -209,6 +209,9 @@ struct task_struct init_task
 #ifdef CONFIG_SECCOMP_FILTER
 	.seccomp	= { .filter_count = ATOMIC_INIT(0) },
 #endif
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+	.pgmap_prot_count = 0,
+#endif
 };
 EXPORT_SYMBOL(init_task);
 
diff --git a/mm/memremap.c b/mm/memremap.c
index d3e6f328a711..b75c4f778c59 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -96,6 +96,20 @@ static void devmap_protection_disable(void)
 	static_branch_dec(&dev_pgmap_protection_static_key);
 }
 
+void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
+{
+	if (!current->pgmap_prot_count++)
+		pks_mk_readwrite(PKS_KEY_PGMAP_PROTECTION);
+}
+EXPORT_SYMBOL_GPL(__pgmap_mk_readwrite);
+
+void __pgmap_mk_noaccess(struct dev_pagemap *pgmap)
+{
+	if (!--current->pgmap_prot_count)
+		pks_mk_noaccess(PKS_KEY_PGMAP_PROTECTION);
+}
+EXPORT_SYMBOL_GPL(__pgmap_mk_noaccess);
+
 bool pgmap_protection_available(void)
 {
 	return pks_available();
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 39/44] memremap_pages: Add memremap.pks_fault_mode
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (37 preceding siblings ...)
  2022-01-27 17:54 ` [PATCH V8 38/44] memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls ira.weiny
@ 2022-01-27 17:55 ` ira.weiny
  2022-02-01  1:16   ` Edgecombe, Rick P
  2022-02-04 19:01   ` Dan Williams
  2022-01-27 17:55 ` [PATCH V8 40/44] memremap_pages: Add pgmap_protection_flag_invalid() ira.weiny
                   ` (4 subsequent siblings)
  43 siblings, 2 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:55 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Some systems may be using pmem in unanticipated ways.  As such, it is
possible an foreseen code path to violate the restrictions of the PMEM
PKS protections.

In order to provide a more seamless integration of the PMEM PKS feature
provide a pks_fault_mode that allows for a relaxed mode should a
previously working feature fault on the PKS protected PMEM.

2 modes are available:

	'relaxed' (default) -- WARN_ONCE, removed the protections, and
	continuing to operate.

	'strict' -- BUG_ON/or fault indicating the error.  This is the
	most protective of the PMEM memory but may be undesirable in
	some configurations.

NOTE: The typedef of pks_fault_modes is required to allow
param_check_pks_fault() to work automatically for us.  So the typedef
checkpatch warning is ignored.

NOTE: There was some debate about if a 3rd mode called 'silent' should
be available.  'silent' would be the same as 'relaxed' but not print any
output.  While 'silent' is nice for admins to reduce console/log output
it would result in less motivation to fix invalid access to the
protected pmem pages.  Therefore, 'silent' is left out.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Use pks_update_exception() instead of abandoning the pkey.
	Split out pgmap_protection_flag_invalid() into a separate patch
		for clarity.
	From Rick Edgecombe
		Fix sysfs_streq() checks
	From Randy Dunlap
		Fix Documentation closing parans

Changes for V7
	Leverage Rick Edgecombe's fault callback infrastructure to relax invalid
		uses and prevent crashes
	From Dan Williams
		Use sysfs_* calls for parameter
		Make pgmap_disable_protection inline
		Remove pfn from warn output
	Remove silent parameter option
---
 .../admin-guide/kernel-parameters.txt         | 14 ++++
 arch/x86/mm/pkeys.c                           |  4 ++
 include/linux/mm.h                            |  3 +
 mm/memremap.c                                 | 67 +++++++++++++++++++
 4 files changed, 88 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f5a27f067db9..3e70a6194831 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4158,6 +4158,20 @@
 	pirq=		[SMP,APIC] Manual mp-table setup
 			See Documentation/x86/i386/IO-APIC.rst.
 
+	memremap.pks_fault_mode=	[X86] Control the behavior of page map
+			protection violations.  Violations may not be an actual
+			use of the memory but simply an attempt to map it in an
+			incompatible way.
+			(depends on CONFIG_DEVMAP_ACCESS_PROTECTION)
+
+			Format: { relaxed | strict }
+
+			relaxed - Print a warning, disable the protection and
+				  continue execution.
+			strict - Stop kernel execution via BUG_ON or fault
+
+			default: relaxed
+
 	plip=		[PPT,NET] Parallel port network link
 			Format: { parport<nr> | timid | 0 }
 			See also Documentation/admin-guide/parport.rst.
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index fa71037c1dd0..e864a9b7828a 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -6,6 +6,7 @@
 #include <linux/debugfs.h>		/* debugfs_create_u32()		*/
 #include <linux/mm_types.h>             /* mm_struct, vma, etc...       */
 #include <linux/pkeys.h>                /* PKEY_*                       */
+#include <linux/mm.h>                   /* fault callback               */
 #include <uapi/asm-generic/mman-common.h>
 
 #include <asm/cpufeature.h>             /* boot_cpu_has, ...            */
@@ -243,6 +244,9 @@ static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = {
 #ifdef CONFIG_PKS_TEST
 	[PKS_KEY_TEST]		= pks_test_fault_callback,
 #endif
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+	[PKS_KEY_PGMAP_PROTECTION]   = pgmap_pks_fault_callback,
+#endif
 };
 
 static bool pks_call_fault_callback(struct pt_regs *regs, unsigned long address,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 60044de77c54..e900df563437 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1193,6 +1193,9 @@ static inline void pgmap_mk_noaccess(struct page *page)
 
 bool pgmap_protection_available(void);
 
+bool pgmap_pks_fault_callback(struct pt_regs *regs, unsigned long address,
+			      bool write);
+
 #else
 
 static inline void __pgmap_mk_readwrite(struct dev_pagemap *pgmap) { }
diff --git a/mm/memremap.c b/mm/memremap.c
index b75c4f778c59..783b1cd4bb42 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -96,6 +96,73 @@ static void devmap_protection_disable(void)
 	static_branch_dec(&dev_pgmap_protection_static_key);
 }
 
+/*
+ * Ignore the checkpatch warning because the typedef allows
+ * param_check_pks_fault_modes to automatically check the passed value.
+ */
+typedef enum {
+	PKS_MODE_STRICT  = 0,
+	PKS_MODE_RELAXED = 1,
+} pks_fault_modes;
+
+pks_fault_modes pks_fault_mode = PKS_MODE_RELAXED;
+
+static int param_set_pks_fault_mode(const char *val, const struct kernel_param *kp)
+{
+	int ret = -EINVAL;
+
+	if (sysfs_streq(val, "relaxed")) {
+		pks_fault_mode = PKS_MODE_RELAXED;
+		ret = 0;
+	} else if (sysfs_streq(val, "strict")) {
+		pks_fault_mode = PKS_MODE_STRICT;
+		ret = 0;
+	}
+
+	return ret;
+}
+
+static int param_get_pks_fault_mode(char *buffer, const struct kernel_param *kp)
+{
+	int ret = 0;
+
+	switch (pks_fault_mode) {
+	case PKS_MODE_STRICT:
+		ret = sysfs_emit(buffer, "strict\n");
+		break;
+	case PKS_MODE_RELAXED:
+		ret = sysfs_emit(buffer, "relaxed\n");
+		break;
+	default:
+		ret = sysfs_emit(buffer, "<unknown>\n");
+		break;
+	}
+
+	return ret;
+}
+
+static const struct kernel_param_ops param_ops_pks_fault_modes = {
+	.set = param_set_pks_fault_mode,
+	.get = param_get_pks_fault_mode,
+};
+
+#define param_check_pks_fault_modes(name, p) \
+	__param_check(name, p, pks_fault_modes)
+module_param(pks_fault_mode, pks_fault_modes, 0644);
+
+bool pgmap_pks_fault_callback(struct pt_regs *regs, unsigned long address,
+			      bool write)
+{
+	/* In strict mode just let the fault handler oops */
+	if (pks_fault_mode == PKS_MODE_STRICT)
+		return false;
+
+	WARN_ONCE(1, "Page map protection being disabled");
+	pks_update_exception(regs, PKS_KEY_PGMAP_PROTECTION, 0);
+	return true;
+}
+EXPORT_SYMBOL_GPL(pgmap_pks_fault_callback);
+
 void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
 {
 	if (!current->pgmap_prot_count++)
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 40/44] memremap_pages: Add pgmap_protection_flag_invalid()
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (38 preceding siblings ...)
  2022-01-27 17:55 ` [PATCH V8 39/44] memremap_pages: Add memremap.pks_fault_mode ira.weiny
@ 2022-01-27 17:55 ` ira.weiny
  2022-02-01  1:37   ` Edgecombe, Rick P
  2022-02-04 19:18   ` Dan Williams
  2022-01-27 17:55 ` [PATCH V8 41/44] kmap: Ensure kmap works for devmap pages ira.weiny
                   ` (3 subsequent siblings)
  43 siblings, 2 replies; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:55 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Some systems may be using pmem in ways that are known to be incompatible
with the PKS implementation.  One such example is the use of kmap() to
create 'global' mappings.

Rather than only reporting the invalid access on fault, provide a call
to flag those uses immediately.  This allows for a much better splat for
debugging to occur.

This is also nice because even if no invalid access' actually occurs,
the invalid mapping can be fixed with kmap_local_page() rather than
having to look for a different solution.

Define pgmap_protection_flag_invalid() and have it follow the policy set
by pks_fault_mode.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Split this from the fault mode patch
---
 include/linux/mm.h | 23 +++++++++++++++++++++++
 mm/memremap.c      |  9 +++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e900df563437..3c0aa686b5bd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,6 +1162,7 @@ static inline bool devmap_protected(struct page *page)
 	return false;
 }
 
+void __pgmap_protection_flag_invalid(struct dev_pagemap *pgmap);
 void __pgmap_mk_readwrite(struct dev_pagemap *pgmap);
 void __pgmap_mk_noaccess(struct dev_pagemap *pgmap);
 
@@ -1178,6 +1179,27 @@ static inline bool pgmap_check_pgmap_prot(struct page *page)
 	return true;
 }
 
+/*
+ * pgmap_protection_flag_invalid - Check and flag an invalid use of a pgmap
+ *                                 protected page
+ *
+ * There are code paths which are known to not be compatible with pgmap
+ * protections.  pgmap_protection_flag_invalid() is provided as a 'relief
+ * valve' to be used in those functions which are known to be incompatible.
+ *
+ * Thus an invalid use case can be flaged with more precise data rather than
+ * just flagging a fault.  Like the fault handler code this abandons the use of
+ * the PKS key and optionally allows the calling code path to continue based on
+ * the configuration of the memremap.pks_fault_mode command line
+ * (and/or sysfs) option.
+ */
+static inline void pgmap_protection_flag_invalid(struct page *page)
+{
+	if (!pgmap_check_pgmap_prot(page))
+		return;
+	__pgmap_protection_flag_invalid(page->pgmap);
+}
+
 static inline void pgmap_mk_readwrite(struct page *page)
 {
 	if (!pgmap_check_pgmap_prot(page))
@@ -1200,6 +1222,7 @@ bool pgmap_pks_fault_callback(struct pt_regs *regs, unsigned long address,
 
 static inline void __pgmap_mk_readwrite(struct dev_pagemap *pgmap) { }
 static inline void __pgmap_mk_noaccess(struct dev_pagemap *pgmap) { }
+static inline void pgmap_protection_flag_invalid(struct page *page) { }
 static inline void pgmap_mk_readwrite(struct page *page) { }
 static inline void pgmap_mk_noaccess(struct page *page) { }
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 783b1cd4bb42..fd4b9b83b770 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -150,6 +150,15 @@ static const struct kernel_param_ops param_ops_pks_fault_modes = {
 	__param_check(name, p, pks_fault_modes)
 module_param(pks_fault_mode, pks_fault_modes, 0644);
 
+void __pgmap_protection_flag_invalid(struct dev_pagemap *pgmap)
+{
+	if (pks_fault_mode == PKS_MODE_STRICT)
+		return;
+
+	WARN_ONCE(1, "Invalid page map use");
+}
+EXPORT_SYMBOL_GPL(__pgmap_protection_flag_invalid);
+
 bool pgmap_pks_fault_callback(struct pt_regs *regs, unsigned long address,
 			      bool write)
 {
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 41/44] kmap: Ensure kmap works for devmap pages
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (39 preceding siblings ...)
  2022-01-27 17:55 ` [PATCH V8 40/44] memremap_pages: Add pgmap_protection_flag_invalid() ira.weiny
@ 2022-01-27 17:55 ` ira.weiny
  2022-02-04 21:07   ` Dan Williams
  2022-01-27 17:55 ` [PATCH V8 42/44] dax: Stray access protection for dax_direct_access() ira.weiny
                   ` (2 subsequent siblings)
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:55 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Users of devmap pages should not have to know that the pages they are
operating on are special.

Co-opt the kmap_{local_page,atomic}() to mediate access to PKS protected
pages via the devmap facility.  kmap_{local_page,atomic}() are both
thread local mappings so they work well with the thread specific
protections available.

kmap(), on the other hand, allows for global mappings to be established,
Which is incompatible with the underlying PKS facility.  For this reason
kmap() is not supported.  Rather than leave the kmap mappings to fault
at random times when users may access them, call
pgmap_protection_flag_invalid() to show kmap() users the call stack of
where mapping was created.  This allows better debugging.

This behavior is safe because neither of the 2 current DAX-capable
filesystems (ext4 and xfs) perform such global mappings.  And known
device drivers that would handle devmap pages are not using kmap().  Any
future filesystems that gain DAX support, or device drivers wanting to
support devmap protected pages will need to use kmap_local_page().

Direct-map exposure is already mitigated by default on HIGHMEM systems
because by definition HIGHMEM systems do not have large capacities of
memory in the direct map.  And using kmap in those systems actually
creates a separate mapping.  Therefore, to reduce complexity HIGHMEM
systems are not supported.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Reword commit message
---
 include/linux/highmem-internal.h | 5 +++++
 mm/Kconfig                       | 1 +
 2 files changed, 6 insertions(+)

diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
index 0a0b2b09b1b8..1a006558734c 100644
--- a/include/linux/highmem-internal.h
+++ b/include/linux/highmem-internal.h
@@ -159,6 +159,7 @@ static inline struct page *kmap_to_page(void *addr)
 static inline void *kmap(struct page *page)
 {
 	might_sleep();
+	pgmap_protection_flag_invalid(page);
 	return page_address(page);
 }
 
@@ -174,6 +175,7 @@ static inline void kunmap(struct page *page)
 
 static inline void *kmap_local_page(struct page *page)
 {
+	pgmap_mk_readwrite(page);
 	return page_address(page);
 }
 
@@ -197,6 +199,7 @@ static inline void __kunmap_local(void *addr)
 #ifdef ARCH_HAS_FLUSH_ON_KUNMAP
 	kunmap_flush_on_unmap(addr);
 #endif
+	pgmap_mk_noaccess(kmap_to_page(addr));
 }
 
 static inline void *kmap_atomic(struct page *page)
@@ -206,6 +209,7 @@ static inline void *kmap_atomic(struct page *page)
 	else
 		preempt_disable();
 	pagefault_disable();
+	pgmap_mk_readwrite(page);
 	return page_address(page);
 }
 
@@ -224,6 +228,7 @@ static inline void __kunmap_atomic(void *addr)
 #ifdef ARCH_HAS_FLUSH_ON_KUNMAP
 	kunmap_flush_on_unmap(addr);
 #endif
+	pgmap_mk_noaccess(kmap_to_page(addr));
 	pagefault_enable();
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		migrate_enable();
diff --git a/mm/Kconfig b/mm/Kconfig
index 67e0264acf7d..d537679448ae 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -779,6 +779,7 @@ config ZONE_DEVICE
 config DEVMAP_ACCESS_PROTECTION
 	bool "Access protection for memremap_pages()"
 	depends on NVDIMM_PFN
+	depends on !HIGHMEM
 	depends on ARCH_HAS_SUPERVISOR_PKEYS
 	select ARCH_ENABLE_SUPERVISOR_PKEYS
 	default y
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 42/44] dax: Stray access protection for dax_direct_access()
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (40 preceding siblings ...)
  2022-01-27 17:55 ` [PATCH V8 41/44] kmap: Ensure kmap works for devmap pages ira.weiny
@ 2022-01-27 17:55 ` ira.weiny
  2022-02-04  5:19   ` Dan Williams
  2022-01-27 17:55 ` [PATCH V8 43/44] nvdimm/pmem: Enable stray access protection ira.weiny
  2022-01-27 17:55 ` [PATCH V8 44/44] devdax: " ira.weiny
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:55 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

dax_direct_access() provides a way to obtain the direct map address of
PMEM memory.  Coordinate PKS protection with dax_direct_access() of
protected devmap pages.

Introduce 3 new dax_operation calls .map_protected .mk_readwrite and
.mk_noaccess. These 3 calls do not have to be implemented by the dax
provider if no protection is implemented.

Threads of execution can use dax_mk_{readwrite,noaccess}() to relax the
protection of the dax device and allow direct use of the kaddr returned
from dax_direct_access().  The dax_mk_{readwrite,noaccess}() calls only
need to be used to guard actual access to the memory.  Other uses of
dax_direct_access() do not need to use these guards.

For users who require a permanent address to the dax device such as the
DM write cache.  dax_map_protected() indicates that the dax device has
additional protections and that user should create it's own permanent
mapping of the memory.  Update the DM write cache code to create this
permanent mapping.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Rebase changes on 5.17-rc1
	Clean up the cover letter
		dax_read_lock() is not required
		s/dax_protected()/dax_map_protected()/
	Testing revealed a dax_flush() which was not properly protected.

Changes for V7
	Rework cover letter.
	Do not include a FS_DAX_LIMITED restriction for dcss.  It  will
		simply not implement the protection and there is no need
		to special case this.
		Clean up commit message because I did not originally
		understand the nuance of the s390 device.
	Introduce dax_{protected,mk_readwrite,mk_noaccess}()
	From Dan Williams
		Remove old clean up cruft from previous versions
		Remove map_protected
	Remove 'global' parameters all calls
---
 drivers/dax/super.c        | 54 ++++++++++++++++++++++++++++++++++++++
 drivers/md/dm-writecache.c |  8 +++++-
 fs/dax.c                   |  8 ++++++
 fs/fuse/virtio_fs.c        |  2 ++
 include/linux/dax.h        |  8 ++++++
 5 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index e3029389d809..705b2e736200 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -117,6 +117,8 @@ enum dax_device_flags {
  * @pgoff: offset in pages from the start of the device to translate
  * @nr_pages: number of consecutive pages caller can handle relative to @pfn
  * @kaddr: output parameter that returns a virtual address mapping of pfn
+ *         Direct access through this pointer must be guarded by calls to
+ *         dax_mk_{readwrite,noaccess}()
  * @pfn: output parameter that returns an absolute pfn translation of @pgoff
  *
  * Return: negative errno if an error occurs, otherwise the number of
@@ -209,6 +211,58 @@ void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
 #endif
 EXPORT_SYMBOL_GPL(dax_flush);
 
+bool dax_map_protected(struct dax_device *dax_dev)
+{
+	if (!dax_alive(dax_dev))
+		return false;
+
+	if (dax_dev->ops->map_protected)
+		return dax_dev->ops->map_protected(dax_dev);
+	return false;
+}
+EXPORT_SYMBOL_GPL(dax_map_protected);
+
+/**
+ * dax_mk_readwrite() - make protected dax devices read/write
+ * @dax_dev: the dax device representing the memory to access
+ *
+ * Any access of the kaddr memory returned from dax_direct_access() must be
+ * guarded by dax_mk_readwrite() and dax_mk_noaccess().  This ensures that any
+ * dax devices which have additional protections are allowed to relax those
+ * protections for the thread using this memory.
+ *
+ * NOTE these calls must be contained within a single thread of execution and
+ * both must be guarded by dax_read_lock()  Which is also a requirement for
+ * dax_direct_access() anyway.
+ */
+void dax_mk_readwrite(struct dax_device *dax_dev)
+{
+	if (!dax_alive(dax_dev))
+		return;
+
+	if (dax_dev->ops->mk_readwrite)
+		dax_dev->ops->mk_readwrite(dax_dev);
+}
+EXPORT_SYMBOL_GPL(dax_mk_readwrite);
+
+/**
+ * dax_mk_noaccess() - restore protection to dax devices if needed
+ * @dax_dev: the dax device representing the memory to access
+ *
+ * See dax_direct_access() and dax_mk_readwrite()
+ *
+ * NOTE Must be called prior to dax_read_unlock()
+ */
+void dax_mk_noaccess(struct dax_device *dax_dev)
+{
+	if (!dax_alive(dax_dev))
+		return;
+
+	if (dax_dev->ops->mk_noaccess)
+		dax_dev->ops->mk_noaccess(dax_dev);
+}
+EXPORT_SYMBOL_GPL(dax_mk_noaccess);
+
 void dax_write_cache(struct dax_device *dax_dev, bool wc)
 {
 	if (wc)
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 4f31591d2d25..5d6d7b6bad30 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -297,7 +297,13 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 		r = -EOPNOTSUPP;
 		goto err2;
 	}
-	if (da != p) {
+
+	/*
+	 * Force the write cache to map the pages directly if the dax device
+	 * mapping is protected or if the number of pages returned was not what
+	 * was requested.
+	 */
+	if (dax_map_protected(wc->ssd_dev->dax_dev) || da != p) {
 		long i;
 		wc->memory_map = NULL;
 		pages = kvmalloc_array(p, sizeof(struct page *), GFP_KERNEL);
diff --git a/fs/dax.c b/fs/dax.c
index cd03485867a7..0b22a1091fe2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -728,7 +728,9 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter *iter
 		return rc;
 	}
 	vto = kmap_atomic(vmf->cow_page);
+	dax_mk_readwrite(iter->iomap.dax_dev);
 	copy_user_page(vto, kaddr, vmf->address, vmf->cow_page);
+	dax_mk_noaccess(iter->iomap.dax_dev);
 	kunmap_atomic(vto);
 	dax_read_unlock(id);
 	return 0;
@@ -937,8 +939,10 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
 	count = 1UL << dax_entry_order(entry);
 	index = xas->xa_index & ~(count - 1);
 
+	dax_mk_readwrite(dax_dev);
 	dax_entry_mkclean(mapping, index, pfn);
 	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);
+	dax_mk_noaccess(dax_dev);
 	/*
 	 * After we have flushed the cache, we can clear the dirty tag. There
 	 * cannot be new dirty data in the pfn after the flush has completed as
@@ -1125,8 +1129,10 @@ static int dax_memzero(struct dax_device *dax_dev, pgoff_t pgoff,
 
 	ret = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
 	if (ret > 0) {
+		dax_mk_readwrite(dax_dev);
 		memset(kaddr + offset, 0, size);
 		dax_flush(dax_dev, kaddr + offset, size);
+		dax_mk_noaccess(dax_dev);
 	}
 	return ret;
 }
@@ -1260,12 +1266,14 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		if (map_len > end - pos)
 			map_len = end - pos;
 
+		dax_mk_readwrite(dax_dev);
 		if (iov_iter_rw(iter) == WRITE)
 			xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
 					map_len, iter);
 		else
 			xfer = dax_copy_to_iter(dax_dev, pgoff, kaddr,
 					map_len, iter);
+		dax_mk_noaccess(dax_dev);
 
 		pos += xfer;
 		length -= xfer;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 9d737904d07c..c748218fe70c 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -774,8 +774,10 @@ static int virtio_fs_zero_page_range(struct dax_device *dax_dev,
 	rc = dax_direct_access(dax_dev, pgoff, nr_pages, &kaddr, NULL);
 	if (rc < 0)
 		return rc;
+	dax_mk_readwrite(dax_dev);
 	memset(kaddr, 0, nr_pages << PAGE_SHIFT);
 	dax_flush(dax_dev, kaddr, nr_pages << PAGE_SHIFT);
+	dax_mk_noaccess(dax_dev);
 	return 0;
 }
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9fc5f99a0ae2..261af298f89f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -30,6 +30,10 @@ struct dax_operations {
 			sector_t, sector_t);
 	/* zero_page_range: required operation. Zero page range   */
 	int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
+
+	bool (*map_protected)(struct dax_device *dax_dev);
+	void (*mk_readwrite)(struct dax_device *dax_dev);
+	void (*mk_noaccess)(struct dax_device *dax_dev);
 };
 
 #if IS_ENABLED(CONFIG_DAX)
@@ -187,6 +191,10 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 			size_t nr_pages);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
 
+bool dax_map_protected(struct dax_device *dax_dev);
+void dax_mk_readwrite(struct dax_device *dax_dev);
+void dax_mk_noaccess(struct dax_device *dax_dev);
+
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops);
 vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 43/44] nvdimm/pmem: Enable stray access protection
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (41 preceding siblings ...)
  2022-01-27 17:55 ` [PATCH V8 42/44] dax: Stray access protection for dax_direct_access() ira.weiny
@ 2022-01-27 17:55 ` ira.weiny
  2022-02-04 21:10   ` Dan Williams
  2022-01-27 17:55 ` [PATCH V8 44/44] devdax: " ira.weiny
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:55 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Now that all valid kernel access' to PMEM have been annotated with
{__}pgmap_mk_{readwrite,noaccess}() PGMAP_PROTECTION is safe to enable
in the pmem layer.

Implement the pmem_map_protected() and pmem_mk_{readwrite,noaccess}() to
communicate this memory has extra protection to the upper layers if
PGMAP_PROTECTION is specified.

Internally, the pmem driver uses a cached virtual address,
pmem->virt_addr (pmem_addr).  Use __pgmap_mk_{readwrite,noaccess}()
directly when PGMAP_PROTECTION is active on the device.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Rebase to 5.17-rc1
	Remove global param
	Add internal structure which uses the pmem device and pgmap
		device directly in the *_mk_*() calls.
	Add pmem dax ops callbacks
	Use pgmap_protection_available()
	s/PGMAP_PKEY_PROTECT/PGMAP_PROTECTION
---
 drivers/nvdimm/pmem.c | 52 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 58d95242a836..2afff8157233 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -138,6 +138,18 @@ static blk_status_t read_pmem(struct page *page, unsigned int off,
 	return BLK_STS_OK;
 }
 
+static void __pmem_mk_readwrite(struct pmem_device *pmem)
+{
+	if (pmem->pgmap.flags & PGMAP_PROTECTION)
+		__pgmap_mk_readwrite(&pmem->pgmap);
+}
+
+static void __pmem_mk_noaccess(struct pmem_device *pmem)
+{
+	if (pmem->pgmap.flags & PGMAP_PROTECTION)
+		__pgmap_mk_noaccess(&pmem->pgmap);
+}
+
 static blk_status_t pmem_do_read(struct pmem_device *pmem,
 			struct page *page, unsigned int page_off,
 			sector_t sector, unsigned int len)
@@ -149,7 +161,10 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
 	if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
 		return BLK_STS_IOERR;
 
+	__pmem_mk_readwrite(pmem);
 	rc = read_pmem(page, page_off, pmem_addr, len);
+	__pmem_mk_noaccess(pmem);
+
 	flush_dcache_page(page);
 	return rc;
 }
@@ -181,11 +196,14 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
 	 * after clear poison.
 	 */
 	flush_dcache_page(page);
+
+	__pmem_mk_readwrite(pmem);
 	write_pmem(pmem_addr, page, page_off, len);
 	if (unlikely(bad_pmem)) {
 		rc = pmem_clear_poison(pmem, pmem_off, len);
 		write_pmem(pmem_addr, page, page_off, len);
 	}
+	__pmem_mk_noaccess(pmem);
 
 	return rc;
 }
@@ -301,11 +319,36 @@ static long pmem_dax_direct_access(struct dax_device *dax_dev,
 	return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
 }
 
+static bool pmem_map_protected(struct dax_device *dax_dev)
+{
+	struct pmem_device *pmem = dax_get_private(dax_dev);
+
+	return (pmem->pgmap.flags & PGMAP_PROTECTION);
+}
+
+static void pmem_mk_readwrite(struct dax_device *dax_dev)
+{
+	__pmem_mk_readwrite(dax_get_private(dax_dev));
+}
+
+static void pmem_mk_noaccess(struct dax_device *dax_dev)
+{
+	__pmem_mk_noaccess(dax_get_private(dax_dev));
+}
+
 static const struct dax_operations pmem_dax_ops = {
 	.direct_access = pmem_dax_direct_access,
 	.zero_page_range = pmem_dax_zero_page_range,
 };
 
+static const struct dax_operations pmem_protected_dax_ops = {
+	.direct_access = pmem_dax_direct_access,
+	.zero_page_range = pmem_dax_zero_page_range,
+	.map_protected = pmem_map_protected,
+	.mk_readwrite = pmem_mk_readwrite,
+	.mk_noaccess = pmem_mk_noaccess,
+};
+
 static ssize_t write_cache_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -427,6 +470,8 @@ static int pmem_attach_disk(struct device *dev,
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+		if (pgmap_protection_available())
+			pmem->pgmap.flags |= PGMAP_PROTECTION;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pfn_sb = nd_pfn->pfn_sb;
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
@@ -440,6 +485,8 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->pgmap.range.end = res->end;
 		pmem->pgmap.nr_range = 1;
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+		if (pgmap_protection_available())
+			pmem->pgmap.flags |= PGMAP_PROTECTION;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pmem->pfn_flags |= PFN_MAP;
 		bb_range = pmem->pgmap.range;
@@ -474,7 +521,10 @@ static int pmem_attach_disk(struct device *dev,
 	nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_range);
 	disk->bb = &pmem->bb;
 
-	dax_dev = alloc_dax(pmem, &pmem_dax_ops);
+	if (pmem->pgmap.flags & PGMAP_PROTECTION)
+		dax_dev = alloc_dax(pmem, &pmem_protected_dax_ops);
+	else
+		dax_dev = alloc_dax(pmem, &pmem_dax_ops);
 	if (IS_ERR(dax_dev)) {
 		rc = PTR_ERR(dax_dev);
 		goto out;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [PATCH V8 44/44] devdax: Enable stray access protection
  2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
                   ` (42 preceding siblings ...)
  2022-01-27 17:55 ` [PATCH V8 43/44] nvdimm/pmem: Enable stray access protection ira.weiny
@ 2022-01-27 17:55 ` ira.weiny
  2022-02-04 21:12   ` Dan Williams
  43 siblings, 1 reply; 145+ messages in thread
From: ira.weiny @ 2022-01-27 17:55 UTC (permalink / raw)
  To: Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Ira Weiny, Fenghua Yu, Rick Edgecombe, linux-kernel

From: Ira Weiny <ira.weiny@intel.com>

Device dax is primarily accessed through user space and kernel access is
controlled through the kmap interfaces.

Now that all valid kernel initiated access to dax devices have been
accounted for, turn on PGMAP_PKEYS_PROTECT for device dax.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V8
	Rebase to 5.17-rc1
	Use pgmap_protection_available()
	s/PGMAP_PKEYS_PROTECT/PGMAP_PROTECTION/
---
 drivers/dax/device.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index d33a0613ed0c..cee375ef2cac 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -452,6 +452,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	if (dev_dax->align > PAGE_SIZE)
 		pgmap->vmemmap_shift =
 			order_base_2(dev_dax->align >> PAGE_SHIFT);
+	if (pgmap_protection_available())
+		pgmap->flags |= PGMAP_PROTECTION;
 	addr = devm_memremap_pages(dev, pgmap);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 02/44] Documentation/protection-keys: Clean up documentation for User Space pkeys
  2022-01-27 17:54 ` [PATCH V8 02/44] Documentation/protection-keys: Clean up documentation for User Space pkeys ira.weiny
@ 2022-01-28 22:39   ` Dave Hansen
  2022-02-01 23:49     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-01-28 22:39 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> +PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
> +Processor" Server CPUs and later.  And it will be available in future
> +non-server Intel parts and future AMD processors.

The non-server parts are quite available these days.  I'm typing on one
right now:

	model name	: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

This can probably say:

Protection Keys for Userspace (PKU) can be found on:
 * Intel server CPUs, Skylake and later
 * Intel client CPUs, Tiger Lake (11th Gen Core) and later
 * Future AMD CPUs

It would be great if the AMD folks can elaborate on that a bit, but I
understand it might not be possible if the CPUs aren't out yet.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 03/44] x86/pkeys: Create pkeys_common.h
  2022-01-27 17:54 ` [PATCH V8 03/44] x86/pkeys: Create pkeys_common.h ira.weiny
@ 2022-01-28 22:43   ` Dave Hansen
  2022-02-02  1:00     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-01-28 22:43 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in
> similar fashions and can share common defines.  Specifically PKS and PKU
> each have:
> 
> 	1. A single control register
> 	2. The same number of keys
> 	3. The same number of bits in the register per key
> 	4. Access and Write disable in the same bit locations
> 
> Given the above, share all the macros that synthesize and manipulate
> register values between the two features.  Share these defines by moving
> them into a new header, change their names to reflect the common use,
> and include the header where needed.

I'd probably include *one* more sentence to prime the reader for the
pattern they are about to see.  Perhaps:

	This mostly takes the form of converting names from the PKU-
	specific "PKRU" to the U/S-agnostic "PKR".

> Also while editing the code remove the use of 'we' from comments being
> touched.
> 
> NOTE the checkpatch errors are ignored for the init_pkru_value to
> align the values in the code.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Either way, this looks fine:

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 04/44] x86/pkeys: Add additional PKEY helper macros
  2022-01-27 17:54 ` [PATCH V8 04/44] x86/pkeys: Add additional PKEY helper macros ira.weiny
@ 2022-01-28 22:47   ` Dave Hansen
  2022-02-02 20:21     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-01-28 22:47 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> +#define PKR_AD_KEY(pkey)	(PKR_AD_BIT << PKR_PKEY_SHIFT(pkey))
> +#define PKR_WD_KEY(pkey)	(PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))

I don't _hate_ this, but naming here is wonky for me.  PKR_WD_KEY reads
to me as "pkey register write-disable key", as in, please write-disable
this key, or maybe "make a write-disable key".

It's generating a mask, so I'd probably name it:

#define PKR_WD_MASK(pkey)	(PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))

Which says, "generate a write-disabled mask for this pkey".

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 05/44] x86/fpu: Refactor arch_set_user_pkey_access()
  2022-01-27 17:54 ` [PATCH V8 05/44] x86/fpu: Refactor arch_set_user_pkey_access() ira.weiny
@ 2022-01-28 22:50   ` Dave Hansen
  2022-02-02 20:22     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-01-28 22:50 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Both PKU and PKS update their register values in the same way.  They can
> therefore share the update code.
> 
> Define a helper, pkey_update_pkval(), which will be used to support both
> Protection Key User (PKU) and the new Protection Key for Supervisor
> (PKS) in subsequent patches.
> 
> pkey_update_pkval() contributed by Thomas
> 
> Co-developed-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Looks better than my original code.  Waaaaaay simpler.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS
  2022-01-27 17:54 ` [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS ira.weiny
@ 2022-01-28 22:54   ` Dave Hansen
  2022-01-28 23:10     ` Ira Weiny
  2022-01-29  0:06   ` Dave Hansen
  1 sibling, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-01-28 22:54 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Protection Key Supervisor, PKS, is a feature used by kernel code only.
> As such if no kernel users are configured the PKS code is unnecessary
> overhead.
> 
> Define a Kconfig structure which allows kernel code to detect PKS
> support by an architecture and then subsequently enable that support
> within the architecture.
> 
> ARCH_HAS_SUPERVISOR_PKEYS indicates to kernel consumers that an
> architecture supports pkeys.  PKS users can then select
> ARCH_ENABLE_SUPERVISOR_PKEYS to turn on the support within the
> architecture.
> 
> If ARCH_ENABLE_SUPERVISOR_PKEYS is not selected architectures avoid the
> PKS overhead.
> 
> ARCH_ENABLE_SUPERVISOR_PKEYS remains off until the first kernel use case
> sets it.

This is heavy on the "what" and weak on the "why".

Why isn't this an x86-specific Kconfig?  Why do we need two Kconfigs?
Good old user pkeys only has one:

	config ARCH_HAS_PKEYS
	        bool

and it's in arch-generic code because there are ppc and x86
implementations *and* the pkey support touches generic code.

This might become evident later in the series, but it's clear as mud as
it stands.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 07/44] x86/pkeys: Add PKS CPU feature bit
  2022-01-27 17:54 ` [PATCH V8 07/44] x86/pkeys: Add PKS CPU feature bit ira.weiny
@ 2022-01-28 23:05   ` Dave Hansen
  2022-02-04 19:21     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-01-28 23:05 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
> specific, manipulation of permission restrictions on supervisor page

Nit: should be "hardware-thread-specific".

> mappings.  It uses the same mechanism of Protection Keys as those on
> User mappings but applies that mechanism to supervisor mappings using a
> supervisor specific MSR.

"supervisor-specific"

	Memory Protection Keys (pkeys) provides a mechanism for
	enforcing page-based protections, but without requiring
	modification of the page tables when an application changes
	protection domains.

	The kernel currently supports the pkeys for userspace (PKU)
	architecture.  That architecture has been extended to
	additionally support supervisor mappings.  The supervisor
	support is referred to as PKS.

I probably wouldn't mention the MSR unless you want to say:

	The main difference between PKU and PKS is that PKS does not
	introduce any new instructions to write to its register.  The
	register is exposed as a normal MSR and is accessed with the
	normal MSR instructions.


> The CPU indicates support for PKS in bit 31 of the ECX register after a
> cpuid instruction.

I'd just remove this sentence.  We don't need to rehash each tiny morsel
of the architecture in a commit message.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS
  2022-01-28 22:54   ` Dave Hansen
@ 2022-01-28 23:10     ` Ira Weiny
  2022-01-28 23:51       ` Dave Hansen
  0 siblings, 1 reply; 145+ messages in thread
From: Ira Weiny @ 2022-01-28 23:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Jan 28, 2022 at 02:54:26PM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Protection Key Supervisor, PKS, is a feature used by kernel code only.
> > As such if no kernel users are configured the PKS code is unnecessary
> > overhead.

Indeed this was a bit weak sorry.  See below.

> > 
> > Define a Kconfig structure which allows kernel code to detect PKS
> > support by an architecture and then subsequently enable that support
> > within the architecture.
> > 
> > ARCH_HAS_SUPERVISOR_PKEYS indicates to kernel consumers that an
> > architecture supports pkeys.  PKS users can then select
> > ARCH_ENABLE_SUPERVISOR_PKEYS to turn on the support within the
> > architecture.
> > 
> > If ARCH_ENABLE_SUPERVISOR_PKEYS is not selected architectures avoid the
> > PKS overhead.
> > 
> > ARCH_ENABLE_SUPERVISOR_PKEYS remains off until the first kernel use case
> > sets it.
> 
> This is heavy on the "what" and weak on the "why".
> 
> Why isn't this an x86-specific Kconfig?  Why do we need two Kconfigs?
> Good old user pkeys only has one:
> 
> 	config ARCH_HAS_PKEYS
> 	        bool
> 
> and it's in arch-generic code because there are ppc and x86
> implementations *and* the pkey support touches generic code.
> 
> This might become evident later in the series, but it's clear as mud as
> it stands.

Sorry, I'll expand on this.

This issue is that because PKS users are in kernel only and are not part of the
architecture specific code there needs to be 2 mechanisms within the Kconfig
structure.  One to communicate an architectures support PKS such that the user
who needs it can depend on that config as well as a second to allow that user
to communicate back to the architecture to enable PKS.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 08/44] x86/fault: Adjust WARN_ON for PKey fault
  2022-01-27 17:54 ` [PATCH V8 08/44] x86/fault: Adjust WARN_ON for PKey fault ira.weiny
@ 2022-01-28 23:10   ` Dave Hansen
  2022-02-04 20:06     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-01-28 23:10 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Previously if a Protection key fault occurred it indicated something
> very wrong because user page mappings are not supposed to be in the
> kernel address space.

This is missing a key point.  The problem is PK faults on "*kernel*
addresses.

> Now PKey faults may happen on kernel mappings if the feature is enabled.

One nit: I've been using "pkeys" and "pkey" as the terms.  I usually
don't capitalize them except at the beginning of a sentence.

> If PKS is enabled, avoid the warning in the fault path.
> 
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  arch/x86/mm/fault.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index d0074c6ed31a..6ed91b632eac 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1148,11 +1148,15 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
>  		   unsigned long address)
>  {
>  	/*
> -	 * Protection keys exceptions only happen on user pages.  We
> -	 * have no user pages in the kernel portion of the address
> -	 * space, so do not expect them here.
> +	 * X86_PF_PK (Protection key exceptions) may occur on kernel addresses
> +	 * when PKS (PKeys Supervisor) is enabled.
> +	 *
> +	 * However, if PKS is not enabled WARN if this exception is seen
> +	 * because there are no user pages in the kernel portion of the address
> +	 * space.
>  	 */
> -	WARN_ON_ONCE(hw_error_code & X86_PF_PK);
> +	WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS) &&
> +		     (hw_error_code & X86_PF_PK));
>  
>  #ifdef CONFIG_X86_32
>  	/*

I'm wondering if this warning is even doing us any good.  I'm pretty
sure it's never triggered on me at least.  Either way, let's not get too
carried away with the comment.  I think this should do:

	/*
	 * PF_PF faults should only occur on kernel
	 * addresses when supervisor pkeys are enabled.
	 */

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 09/44] x86/pkeys: Enable PKS on cpus which support it
  2022-01-27 17:54 ` [PATCH V8 09/44] x86/pkeys: Enable PKS on cpus which support it ira.weiny
@ 2022-01-28 23:18   ` Dave Hansen
  2022-01-28 23:41     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-01-28 23:18 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
> specific, manipulation of permission restrictions on supervisor page
> mappings.  It uses the same mechanism of Protection Keys as those on
> User mappings but applies that mechanism to supervisor mappings using a
> supervisor specific MSR.
> 
> Bit 24 of CR4 is used to enable the feature by software.  Define
> pks_setup() to be called when PKS is configured.

Again, no need to specify the bit numbers.  We have it in the code. :)
At most, just say something like:

	PKS is enabled by a new bit in a control register.
or
	PKS is enabled by a new bit in CR4.

> Initially, pks_setup() initializes the per-cpu MSR with 0 to enable all
> access on all pkeys.

Why not just make it restrictive to start out?  That's what we do for PKU.

> asm/pks.h is added as a new file to store new
> internal functions and structures such as pks_setup().

One writing nit: try to speak in active voice.

Passive: "Foo is added"
Active: "Add foo"

It actually makes thing shorter and easier to read:

	Add asm/pks.h to store new internal functions and structures
	such as pks_setup().

> diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
> index bcba3c643e63..191c574b2390 100644
> --- a/arch/x86/include/uapi/asm/processor-flags.h
> +++ b/arch/x86/include/uapi/asm/processor-flags.h
> @@ -130,6 +130,8 @@
>  #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
>  #define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
>  #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
> +#define X86_CR4_PKS_BIT		24 /* enable Protection Keys for Supervisor */
> +#define X86_CR4_PKS		_BITUL(X86_CR4_PKS_BIT)
>  
>  /*
>   * x86-64 Task Priority Register, CR8
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 7b8382c11788..83c1abce7d93 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -59,6 +59,7 @@
>  #include <asm/cpu_device_id.h>
>  #include <asm/uv/uv.h>
>  #include <asm/sigframe.h>
> +#include <asm/pks.h>
>  
>  #include "cpu.h"
>  
> @@ -1632,6 +1633,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
>  
>  	x86_init_rdrand(c);
>  	setup_pku(c);
> +	pks_setup();
>  
>  	/*
>  	 * Clear/Set all flags overridden by options, need do it
> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> index cf12d8bf122b..02629219e683 100644
> --- a/arch/x86/mm/pkeys.c
> +++ b/arch/x86/mm/pkeys.c
> @@ -206,3 +206,19 @@ u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
>  	pkval &= ~(PKEY_ACCESS_MASK << shift);
>  	return pkval | accessbits << shift;
>  }
> +
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +
> +/*
> + * PKS is independent of PKU and either or both may be supported on a CPU.
> + */
> +void pks_setup(void)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> +		return;
> +
> +	wrmsrl(MSR_IA32_PKRS, 0);

This probably needs a one-line comment about what it's doing.  As a
general rule, I'd much rather have a one-sentence note in a code comment
than in the changelog.

> +	cr4_set_bits(X86_CR4_PKS);
> +}
> +
> +#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 09/44] x86/pkeys: Enable PKS on cpus which support it
  2022-01-28 23:18   ` Dave Hansen
@ 2022-01-28 23:41     ` Ira Weiny
  2022-01-28 23:53       ` Dave Hansen
  0 siblings, 1 reply; 145+ messages in thread
From: Ira Weiny @ 2022-01-28 23:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Jan 28, 2022 at 03:18:29PM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> > Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
> > specific, manipulation of permission restrictions on supervisor page
> > mappings.  It uses the same mechanism of Protection Keys as those on
> > User mappings but applies that mechanism to supervisor mappings using a
> > supervisor specific MSR.
> > 
> > Bit 24 of CR4 is used to enable the feature by software.  Define
> > pks_setup() to be called when PKS is configured.
> 
> Again, no need to specify the bit numbers.  We have it in the code. :)
> At most, just say something like:
> 
> 	PKS is enabled by a new bit in a control register.
> or
> 	PKS is enabled by a new bit in CR4.
> 
> > Initially, pks_setup() initializes the per-cpu MSR with 0 to enable all
> > access on all pkeys.
> 
> Why not just make it restrictive to start out?  That's what we do for PKU.

This maintains compatibility with the code prior to this patch.  Ie no
restrictions on kernel mappings.

I'll place the default value patch before this one and use it in this patch.

> 
> > asm/pks.h is added as a new file to store new
> > internal functions and structures such as pks_setup().
> 
> One writing nit: try to speak in active voice.
> 
> Passive: "Foo is added"
> Active: "Add foo"
> 
> It actually makes thing shorter and easier to read:
> 
> 	Add asm/pks.h to store new internal functions and structures
> 	such as pks_setup().

Ok.  I'll update the commit message.

> 
> > diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
> > index bcba3c643e63..191c574b2390 100644
> > --- a/arch/x86/include/uapi/asm/processor-flags.h
> > +++ b/arch/x86/include/uapi/asm/processor-flags.h
> > @@ -130,6 +130,8 @@
> >  #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
> >  #define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
> >  #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
> > +#define X86_CR4_PKS_BIT		24 /* enable Protection Keys for Supervisor */
> > +#define X86_CR4_PKS		_BITUL(X86_CR4_PKS_BIT)
> >  
> >  /*
> >   * x86-64 Task Priority Register, CR8
> > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > index 7b8382c11788..83c1abce7d93 100644
> > --- a/arch/x86/kernel/cpu/common.c
> > +++ b/arch/x86/kernel/cpu/common.c
> > @@ -59,6 +59,7 @@
> >  #include <asm/cpu_device_id.h>
> >  #include <asm/uv/uv.h>
> >  #include <asm/sigframe.h>
> > +#include <asm/pks.h>
> >  
> >  #include "cpu.h"
> >  
> > @@ -1632,6 +1633,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)
> >  
> >  	x86_init_rdrand(c);
> >  	setup_pku(c);
> > +	pks_setup();
> >  
> >  	/*
> >  	 * Clear/Set all flags overridden by options, need do it
> > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> > index cf12d8bf122b..02629219e683 100644
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -206,3 +206,19 @@ u32 pkey_update_pkval(u32 pkval, int pkey, u32 accessbits)
> >  	pkval &= ~(PKEY_ACCESS_MASK << shift);
> >  	return pkval | accessbits << shift;
> >  }
> > +
> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> > +
> > +/*
> > + * PKS is independent of PKU and either or both may be supported on a CPU.
> > + */
> > +void pks_setup(void)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> > +		return;
> > +
> > +	wrmsrl(MSR_IA32_PKRS, 0);
> 
> This probably needs a one-line comment about what it's doing.  As a
> general rule, I'd much rather have a one-sentence note in a code comment
> than in the changelog.

Fair enough,
Ira

> 
> > +	cr4_set_bits(X86_CR4_PKS);
> > +}
> > +
> > +#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS
  2022-01-28 23:10     ` Ira Weiny
@ 2022-01-28 23:51       ` Dave Hansen
  2022-02-04 19:08         ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-01-28 23:51 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On 1/28/22 15:10, Ira Weiny wrote:
> This issue is that because PKS users are in kernel only and are not part of the
> architecture specific code there needs to be 2 mechanisms within the Kconfig
> structure.  One to communicate an architectures support PKS such that the user
> who needs it can depend on that config as well as a second to allow that user
> to communicate back to the architecture to enable PKS.

I *think* the point here is to ensure that PKS isn't compiled in unless
it is supported *AND* needed.  You have to have architecture support
(ARCH_HAS_SUPERVISOR_PKEYS) to permit features that depend on PKS to be
enabled.  Then, once one ore more of *THOSE* is enabled,
ARCH_ENABLE_SUPERVISOR_PKEYS comes into play and actually compiles the
feature in.

In other words, there are two things that must happen before the code
gets compiled in:

1. Arch support
2. One or more features to use the arch support


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 09/44] x86/pkeys: Enable PKS on cpus which support it
  2022-01-28 23:41     ` Ira Weiny
@ 2022-01-28 23:53       ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2022-01-28 23:53 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On 1/28/22 15:41, Ira Weiny wrote:
> On Fri, Jan 28, 2022 at 03:18:29PM -0800, Dave Hansen wrote:
>> On 1/27/22 09:54, ira.weiny@intel.com wrote:
>>> Initially, pks_setup() initializes the per-cpu MSR with 0 to enable all
>>> access on all pkeys.
>>
>> Why not just make it restrictive to start out?  That's what we do for PKU.
> 
> This maintains compatibility with the code prior to this patch.  Ie no
> restrictions on kernel mappings.

But, compatibility with what?  At this point, there are no non-pkey-0
kernel mappings.  So, PKRS can be set to anything as long as the two low
bits are clear.


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 10/44] Documentation/pkeys: Add initial PKS documentation
  2022-01-27 17:54 ` [PATCH V8 10/44] Documentation/pkeys: Add initial PKS documentation ira.weiny
@ 2022-01-28 23:57   ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2022-01-28 23:57 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Add initial overview and configuration information about PKS.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  Documentation/core-api/protection-keys.rst | 57 ++++++++++++++++++++--
>  1 file changed, 53 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
> index 12331db474aa..58670e3ee39e 100644
> --- a/Documentation/core-api/protection-keys.rst
> +++ b/Documentation/core-api/protection-keys.rst
> @@ -12,6 +12,9 @@ PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
>  Processor" Server CPUs and later.  And it will be available in future
>  non-server Intel parts and future AMD processors.
>  
> +Protection Keys for Supervisor pages (PKS) is available in the SDM since May
> +2020.

I'd just remove this.  Folks don't need to know the SDM history.  I'd
only talk about it here if they would have a hard time finding it
somehow.  Seeing as its in the main SDM, I can't see how that's a problem.

>  pkeys work by dedicating 4 previously Reserved bits in each page table entry to
>  a "protection key", giving 16 possible keys.
>  
> @@ -22,13 +25,20 @@ and Write Disable) for each of 16 keys.
>  Being a CPU register, PKRU is inherently thread-local, potentially giving each
>  thread a different set of protections from every other thread.
>  
> -There are two instructions (RDPKRU/WRPKRU) for reading and writing to the
> -register.  The feature is only available in 64-bit mode, even though there is
> +For Userspace (PKU), there are two instructions (RDPKRU/WRPKRU) for reading and
> +writing to the register.
> +
> +For Supervisor (PKS), the register (MSR_IA32_PKRS) is accessible only to the
> +kernel through rdmsr and wrmsr.
> +
> +The feature is only available in 64-bit mode, even though there is
>  theoretically space in the PAE PTEs.  These permissions are enforced on data
>  access only and have no effect on instruction fetches.
>  
> -Syscalls
> -========
> +
> +
> +Syscalls for user space keys
> +============================
>  
>  There are 3 system calls which directly interact with pkeys::
>  
> @@ -95,3 +105,42 @@ with a read()::
>  The kernel will send a SIGSEGV in both cases, but si_code will be set
>  to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when
>  the plain mprotect() permissions are violated.
> +
> +
> +Kernel API for PKS support
> +==========================
> +
> +Overview
> +--------
> +
> +Similar to user space pkeys, supervisor pkeys allow additional protections to
> +be defined for a supervisor mappings.  Unlike user space pkeys, violations of
> +these protections result in a kernel oops.
> +
> +Supervisor Memory Protection Keys (PKS) is a feature which is found on Intel's
> +Sapphire Rapids (and later) "Scalable Processor" Server CPUs.  It will also be
> +available in future non-server Intel parts.

This is a little weird.  You've already talked about PKRS and then later
introduce the feature?

Also, perhaps this CPU model bit should just be next to the CPU model
bit about PKU.

> +Also qemu has support as well: https://www.qemu.org/2021/04/30/qemu-6-0-0/
> +
> +Kconfig
> +-------
> +Kernel users intending to use PKS support should depend on
> +ARCH_HAS_SUPERVISOR_PKEYS, and select ARCH_ENABLE_SUPERVISOR_PKEYS to turn on
> +this support within the core.

Maybe this should talk about the Kconfig options a bit more.  Maybe even
an example:

config MY_NEW_FEATURE
	depends on ARCH_HAS_SUPERVISOR_PKEYS
	select ARCH_ENABLE_SUPERVISOR_PKEYS

This will make "MY_NEW_FEATURE" unavailable unless the architecture sets
ARCH_HAS_SUPERVISOR_PKEYS.  It also makes it possible for multiple
independent features to  "select ARCH_ENABLE_SUPERVISOR_PKEYS".  PKS
support will not be compiled into the kernel unless one or more features
selects ARCH_ENABLE_SUPERVISOR_PKEYS.

> +MSR details
> +-----------
> +
> +It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing
> +but still maintains ordering properties similar to WRPKRU.

s/It should be noted that the underlying //

I'd probably say:

	WRMSR is typically an architecturally serializing instruction.
	However, WRMSR(MSR_IA32_PKRS) is an exceptions.  It is not a
	serializing instruction and instead maintains ordering
	properties similar to WRPKRU.

and maybe:

	Check the WRPKRU documentation in the latest version of the SDM
	for details.

> +Older versions of the SDM on PKRS may be wrong with regard to this
> +serialization.  The text should be the same as that of WRPKRU.  From the WRPKRU
> +text:
> +
> +	WRPKRU will never execute transiently. Memory accesses
> +	affected by PKRU register will not execute (even transiently)
> +	until all prior executions of WRPKRU have completed execution
> +	and updated the PKRU register.

I wouldn't go over this.  Software has bugs.  Documentation has bugs.  I
expect folks to use the most recent version.

BTW, there are still a few places in SDM 076 that miss mentioning the
non-serializing properties of PKRS.  I also don't see anything
specifically about the speculative behavior.  There might be fixes on
the way, but can you double-check?

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 11/44] mm/pkeys: Define static PKS key array and default values
  2022-01-27 17:54 ` [PATCH V8 11/44] mm/pkeys: Define static PKS key array and default values ira.weiny
@ 2022-01-29  0:02   ` Dave Hansen
  2022-02-04 23:54     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-01-29  0:02 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> +#define PKS_INIT_VALUE (PKR_RW_KEY(PKS_KEY_DEFAULT)		| \
> +			PKR_AD_KEY(1)	| \
> +			PKR_AD_KEY(2)	| PKR_AD_KEY(3)		| \
> +			PKR_AD_KEY(4)	| PKR_AD_KEY(5)		| \
> +			PKR_AD_KEY(6)	| PKR_AD_KEY(7)		| \
> +			PKR_AD_KEY(8)	| PKR_AD_KEY(9)		| \
> +			PKR_AD_KEY(10)	| PKR_AD_KEY(11)	| \
> +			PKR_AD_KEY(12)	| PKR_AD_KEY(13)	| \
> +			PKR_AD_KEY(14)	| PKR_AD_KEY(15))

Considering how this is going to get used, let's just make this
one-key-per-line:

#define PKS_INIT_VALUE (PKR_RW_KEY(PKS_KEY_DEFAULT)		| \
			PKR_AD_KEY(1)	| \
			PKR_AD_KEY(2)	| \
			PKR_AD_KEY(3)	| \
			...


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS
  2022-01-27 17:54 ` [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS ira.weiny
  2022-01-28 22:54   ` Dave Hansen
@ 2022-01-29  0:06   ` Dave Hansen
  2022-02-04 19:14     ` Ira Weiny
  1 sibling, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-01-29  0:06 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> @@ -1867,6 +1867,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
>  	depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
>  	select ARCH_USES_HIGH_VMA_FLAGS
>  	select ARCH_HAS_PKEYS
> +	select ARCH_HAS_SUPERVISOR_PKEYS

For now, this should be:

	select ARCH_HAS_SUPERVISOR_PKEYS if CPU_SUP_INTEL

unless the AMD folks speak up and say otherwise. :)

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 14/44] x86/pkeys: Introduce pks_write_pkrs()
  2022-01-27 17:54 ` [PATCH V8 14/44] x86/pkeys: Introduce pks_write_pkrs() ira.weiny
@ 2022-01-29  0:12   ` Dave Hansen
  2022-01-29  0:16     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-01-29  0:12 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> Writing to MSR's is inefficient.  Even though the underlying
> WRMSR(MSR_IA32_PKRS) is not serializing (see below), writing to the MSR
> unnecessarily should be avoided.  This is especially true when the value
> of the PKS protections is unlikely to change from the default often.

This probably needs some context.

The most important pks_write_pkrs() user is in the scheduler, right?

So, this is really about optimizing that scheduler code for the common
case where, even when changing threads, the PKRS value does not change.

Can you explain a bit why you expect that to be the common case?

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 14/44] x86/pkeys: Introduce pks_write_pkrs()
  2022-01-29  0:12   ` Dave Hansen
@ 2022-01-29  0:16     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-01-29  0:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Jan 28, 2022 at 04:12:06PM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> > Writing to MSR's is inefficient.  Even though the underlying
> > WRMSR(MSR_IA32_PKRS) is not serializing (see below), writing to the MSR
> > unnecessarily should be avoided.  This is especially true when the value
> > of the PKS protections is unlikely to change from the default often.
> 
> This probably needs some context.
> 
> The most important pks_write_pkrs() user is in the scheduler, right?

This is also used during exceptions, twice.  Those are probably more important.

> 
> So, this is really about optimizing that scheduler code for the common
> case where, even when changing threads, the PKRS value does not change.
> 
> Can you explain a bit why you expect that to be the common case?

Yes.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 15/44] x86/pkeys: Preserve the PKS MSR on context switch
  2022-01-27 17:54 ` [PATCH V8 15/44] x86/pkeys: Preserve the PKS MSR on context switch ira.weiny
@ 2022-01-29  0:22   ` Dave Hansen
  2022-02-11  6:10     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-01-29  0:22 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> The PKS MSR (PKRS) is defined as a per-logical-processor register.  This

s/defined as//

> isolates memory access by logical CPU.  

This second sentence is a bit confusing to me.  I *think* you're trying
to say that PKRS only affects accesses from one logical CPU.  But, it
just comes out strangely.  I think I'd just axe the sentence.

> Unfortunately, the MSR is not
> managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
> context switch.
> 
> Define pks_saved_pkrs in struct thread_struct.  Initialize all tasks,
> including the init_task, with the PKS_INIT_VALUE when created.  Restore
> the CPU's MSR to the saved task value on schedule in.
> 
> pks_write_current() is added to ensures non-supervisor pkey

				  ^ ensure

...
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 2c5f12ae7d04..3530a0e50b4f 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -2,6 +2,8 @@
>  #ifndef _ASM_X86_PROCESSOR_H
>  #define _ASM_X86_PROCESSOR_H
>  
> +#include <linux/pks-keys.h>
> +
>  #include <asm/processor-flags.h>
>  
>  /* Forward declaration, a strange C thing */
> @@ -502,6 +504,12 @@ struct thread_struct {
>  	unsigned long		cr2;
>  	unsigned long		trap_nr;
>  	unsigned long		error_code;
> +
> +#ifdef	CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +	/* Saved Protection key register for supervisor mappings */
> +	u32			pks_saved_pkrs;
> +#endif

There are a bunch of other "saved" registers in thread_struct.  They all
just have their register name, including pkru.

Can you just stick this next to 'pkru' and call it plain old 'pkrs'?
That will probably even take up less space than this since the two
32-bit values can be packed together.

>  #ifdef CONFIG_VM86
>  	/* Virtual 86 mode info */
>  	struct vm86		*vm86;
> @@ -769,7 +777,14 @@ static inline void spin_lock_prefetch(const void *x)
>  #define KSTK_ESP(task)		(task_pt_regs(task)->sp)
>  
>  #else
> -#define INIT_THREAD { }
> +
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +#define INIT_THREAD  {					\
> +	.pks_saved_pkrs = PKS_INIT_VALUE,		\
> +}
> +#else
> +#define INIT_THREAD  { }
> +#endif
>  
>  extern unsigned long KSTK_ESP(struct task_struct *task);
>  
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 3402edec236c..81fc0b638308 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -59,6 +59,7 @@
>  /* Not included via unistd.h */
>  #include <asm/unistd_32_ia32.h>
>  #endif
> +#include <asm/pks.h>
>  
>  #include "process.h"
>  
> @@ -657,6 +658,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>  	/* Load the Intel cache allocation PQR MSR. */
>  	resctrl_sched_in();
>  
> +	pks_write_current();
> +
>  	return prev_p;
>  }

At least for pkru and fsgsbase, these have the form:

	x86_<register>_load();

Should this be

	x86_pkrs_load();

and be located next to:

	x86_pkru_load()?

> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> index 3dce99ef4127..6d94dfc9a219 100644
> --- a/arch/x86/mm/pkeys.c
> +++ b/arch/x86/mm/pkeys.c
> @@ -242,6 +242,19 @@ static inline void pks_write_pkrs(u32 new_pkrs)
>  	}
>  }
>  
> +/**
> + * pks_write_current() - Write the current thread's saved PKRS value
> + *
> + * Context: must be called with preemption disabled
> + */
> +void pks_write_current(void)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> +		return;
> +
> +	pks_write_pkrs(current->thread.pks_saved_pkrs);
> +}
> +
>  /*
>   * PKS is independent of PKU and either or both may be supported on a CPU.
>   *


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 13/44] mm/pkeys: Add initial PKS Test code
  2022-01-27 17:54 ` [PATCH V8 13/44] mm/pkeys: Add initial PKS Test code ira.weiny
@ 2022-01-31 19:30   ` Edgecombe, Rick P
  2022-02-09 23:44     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Edgecombe, Rick P @ 2022-01-31 19:30 UTC (permalink / raw)
  To: hpa, Williams, Dan J, Weiny, Ira, dave.hansen; +Cc: Yu, Fenghua, linux-kernel

On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> +static void crash_it(void)
> +{
> +       struct pks_test_ctx *ctx;
> +       void *ptr;
> +
> +       pr_warn("     ***** BEGIN: Unhandled fault test *****\n");
> +
> +       ctx = alloc_ctx(PKS_KEY_TEST);
> +       if (IS_ERR(ctx)) {
> +               pr_err("Failed to allocate context???\n");
> +               return;
> +       }
> +
> +       ptr = alloc_test_page(ctx->pkey);
> +       if (!ptr) {
> +               pr_err("Failed to vmalloc page???\n");
> +               return;
> +       }
> +
> +       /* This purposely faults */
> +       memcpy(ptr, ctx->data, 8);
> +
> +       /* Should never get here if so the test failed */
> +       last_test_pass = false;
> +
> +       vfree(ptr);
> +       free_ctx(ctx);

So these only gets cleaned up if the test fails? Could you clean them
up in pks_release_file() like the later test patch?

> +}

snip

> +
> +static void __exit pks_test_exit(void)
> +{
> +       debugfs_remove(pks_test_dentry);
> +       pr_info("test exit\n");
> +}

How does this get called?

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 18/44] x86/fault: Add a PKS test fault hook
  2022-01-27 17:54 ` [PATCH V8 18/44] x86/fault: Add a PKS test fault hook ira.weiny
@ 2022-01-31 19:56   ` Edgecombe, Rick P
  2022-02-11 20:40     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Edgecombe, Rick P @ 2022-01-31 19:56 UTC (permalink / raw)
  To: hpa, Williams, Dan J, Weiny, Ira, dave.hansen; +Cc: Yu, Fenghua, linux-kernel

On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> +                * If a protection key exception occurs it could be
> because a PKS test
> +                * is running.  If so, pks_test_callback() will clear
> the protection
> +                * mechanism and return true to indicate the fault
> was handled.
> +                */
> +               if (pks_test_callback())
> +                       return;

Why do we need both this and pks_handle_key_fault()?

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 16/44] mm/pkeys: Introduce pks_mk_readwrite()
  2022-01-27 17:54 ` [PATCH V8 16/44] mm/pkeys: Introduce pks_mk_readwrite() ira.weiny
@ 2022-01-31 23:10   ` Edgecombe, Rick P
  2022-02-18  2:22     ` Ira Weiny
  2022-02-01 17:40   ` Dave Hansen
  1 sibling, 1 reply; 145+ messages in thread
From: Edgecombe, Rick P @ 2022-01-31 23:10 UTC (permalink / raw)
  To: hpa, Williams, Dan J, Weiny, Ira, dave.hansen; +Cc: Yu, Fenghua, linux-kernel

On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> +void pks_update_protection(int pkey, u32 protection)
> +{

I don't know if this matters too much, but the type of a pkey is either
int or u16 across this series and PKU. But it's only possibly a 4 bit
value. Seems the smallest that would fit is char. Why use one over the
other?

Also, why u32 for protection here? The whole pkrs value containing the
bits for all keys is 32 bits, but per key there is only room ever for 2
bits, right?

It would be nice to be consistent and have a reason, but again, I don't
know if makes any real difference.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 30/44] mm/pkeys: Test setting a PKS key in a custom fault callback
  2022-01-27 17:54 ` [PATCH V8 30/44] mm/pkeys: Test setting a PKS key in a custom fault callback ira.weiny
@ 2022-02-01  0:55   ` Edgecombe, Rick P
  2022-03-01 15:39     ` Ira Weiny
  2022-02-01 17:42   ` Edgecombe, Rick P
  1 sibling, 1 reply; 145+ messages in thread
From: Edgecombe, Rick P @ 2022-02-01  0:55 UTC (permalink / raw)
  To: hpa, Williams, Dan J, Weiny, Ira, dave.hansen; +Cc: Yu, Fenghua, linux-kernel

On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> Add a test which does this.
> 
>         $ echo 5 > /sys/kernel/debug/x86/run_pks
>         $ cat /sys/kernel/debug/x86/run_pks
>         PASS

Hmm, when I run this on qemu TCG, I get:

root@(none):/# echo 5 > /sys/kernel/debug/x86/run_pks
[   29.438159] pks_test: Failed to see the callback
root@(none):/# cat /sys/kernel/debug/x86/run_pks
FAIL

I think it's a problem with the test though. The generated code is not
expecting fault_callback_ctx.callback_seen to get changed in the
exception. The following fixed it for me:

diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 1528df0bb283..d979d2afe921 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -570,6 +570,7 @@ static bool run_fault_clear_test(void)
        /* fault */
        memcpy(test_page, ctx->data, 8);
 
+       barrier();
        if (!fault_callback_ctx.callback_seen) {
                pr_err("Failed to see the callback\n");
                rc = false;

But, I wonder if volatile is also needed on the read to be fully
correct. I usually have to consult the docs when I deal with that
stuff...

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 39/44] memremap_pages: Add memremap.pks_fault_mode
  2022-01-27 17:55 ` [PATCH V8 39/44] memremap_pages: Add memremap.pks_fault_mode ira.weiny
@ 2022-02-01  1:16   ` Edgecombe, Rick P
  2022-03-02  0:20     ` Ira Weiny
  2022-02-04 19:01   ` Dan Williams
  1 sibling, 1 reply; 145+ messages in thread
From: Edgecombe, Rick P @ 2022-02-01  1:16 UTC (permalink / raw)
  To: hpa, Williams, Dan J, Weiny, Ira, dave.hansen; +Cc: Yu, Fenghua, linux-kernel

On Thu, 2022-01-27 at 09:55 -0800, ira.weiny@intel.com wrote:
> +static int param_get_pks_fault_mode(char *buffer, const struct
> kernel_param *kp)
> +{
> +       int ret = 0;
This doesn't need to be initialized.

> +
> +       switch (pks_fault_mode) {
> +       case PKS_MODE_STRICT:
> +               ret = sysfs_emit(buffer, "strict\n");
> +               break;
> +       case PKS_MODE_RELAXED:
> +               ret = sysfs_emit(buffer, "relaxed\n");
> +               break;
> +       default:
> +               ret = sysfs_emit(buffer, "<unknown>\n");
> +               break;
> +       }
> +
> +       return ret;
> +}

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 40/44] memremap_pages: Add pgmap_protection_flag_invalid()
  2022-01-27 17:55 ` [PATCH V8 40/44] memremap_pages: Add pgmap_protection_flag_invalid() ira.weiny
@ 2022-02-01  1:37   ` Edgecombe, Rick P
  2022-03-02  2:01     ` Ira Weiny
  2022-02-04 19:18   ` Dan Williams
  1 sibling, 1 reply; 145+ messages in thread
From: Edgecombe, Rick P @ 2022-02-01  1:37 UTC (permalink / raw)
  To: hpa, Williams, Dan J, Weiny, Ira, dave.hansen; +Cc: Yu, Fenghua, linux-kernel

On Thu, 2022-01-27 at 09:55 -0800, ira.weiny@intel.com wrote:
> +/*
> + * pgmap_protection_flag_invalid - Check and flag an invalid use of
> a pgmap
> + *                                 protected page
> + *
> + * There are code paths which are known to not be compatible with
> pgmap
> + * protections.  

This could get hopefully get stale. Maybe the comment should just
describe what the function does and leave this reasoning to the commit
log?

> pgmap_protection_flag_invalid() is provided as a 'relief
> + * valve' to be used in those functions which are known to be
> incompatible.
> + *
> + * Thus an invalid use case can be flaged with more precise data
> rather than
> + * just flagging a fault.  Like the fault handler code this abandons

In the commit log you called this "the invalid access on fault" and it
seemed a little clearer to me then "just flagging a fault".

> the use of
> + * the PKS key and optionally allows the calling code path to
> continue based on
> + * the configuration of the memremap.pks_fault_mode command line
> + * (and/or sysfs) option.

It lets the calling code continue regardless right? It just warns if
!PKS_MODE_STRICT. Why not warn in the case of PKS_MODE_STRICT too?

Seems surprising that the stricter setting would have less checks.

> + */
> +static inline void pgmap_protection_flag_invalid(struct page *page)
> +{
> +       if (!pgmap_check_pgmap_prot(page))
> +               return;
> +       __pgmap_protection_flag_invalid(page->pgmap);
> +}

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 16/44] mm/pkeys: Introduce pks_mk_readwrite()
  2022-01-27 17:54 ` [PATCH V8 16/44] mm/pkeys: Introduce pks_mk_readwrite() ira.weiny
  2022-01-31 23:10   ` Edgecombe, Rick P
@ 2022-02-01 17:40   ` Dave Hansen
  2022-02-18  4:39     ` Ira Weiny
  1 sibling, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-02-01 17:40 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
> +static inline void pks_mk_readwrite(int pkey)
> +{
> +	pks_update_protection(pkey, PKEY_READ_WRITE);
> +}

I don't really like the "mk" terminology in here.  Maybe it's from
dealing with the PTE helpers, but "mk" to me means that it won't do
anything observable by itself.  We're also not starved for space here,
and it's really odd to abbreviate "make->mk" but not do "readwrite->rw".

This really is going off and changing a register value.  I think:

	pks_set_readwrite()

would be fine.  This starts to get a bit redundant, but looks fine too:

	pks_set_key_readwrite()

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 30/44] mm/pkeys: Test setting a PKS key in a custom fault callback
  2022-01-27 17:54 ` [PATCH V8 30/44] mm/pkeys: Test setting a PKS key in a custom fault callback ira.weiny
  2022-02-01  0:55   ` Edgecombe, Rick P
@ 2022-02-01 17:42   ` Edgecombe, Rick P
  2022-02-11 20:44     ` Ira Weiny
  1 sibling, 1 reply; 145+ messages in thread
From: Edgecombe, Rick P @ 2022-02-01 17:42 UTC (permalink / raw)
  To: hpa, Williams, Dan J, Weiny, Ira, dave.hansen; +Cc: Yu, Fenghua, linux-kernel

On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> +#define RUN_FAULT_ABANDON      5

The tests still call this operation "abandon" all throughout, but the
operation got renamed in the kernel. Probably should rename it in the
tests too.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 20/44] mm/pkeys: Add PKS test for context switching
  2022-01-27 17:54 ` [PATCH V8 20/44] mm/pkeys: Add PKS test for context switching ira.weiny
@ 2022-02-01 17:43   ` Edgecombe, Rick P
  2022-02-22 21:42     ` Ira Weiny
  2022-02-01 17:47   ` Edgecombe, Rick P
  1 sibling, 1 reply; 145+ messages in thread
From: Edgecombe, Rick P @ 2022-02-01 17:43 UTC (permalink / raw)
  To: hpa, Williams, Dan J, Weiny, Ira, dave.hansen; +Cc: Yu, Fenghua, linux-kernel

On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> +int check_context_switch(int cpu)
> +{
> +       int switch_done[2];
> +       int setup_done[2];
> +       cpu_set_t cpuset;
> +       char result[32];
> +       int rc = 0;
> +       pid_t pid;
> +       int fd;
> +
> +       CPU_ZERO(&cpuset);
> +       CPU_SET(cpu, &cpuset);
> +       /*
> +        * Ensure the two processes run on the same CPU so that they
> go through
> +        * a context switch.
> +        */
> +       sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset);
> +
> +       if (pipe(setup_done)) {
> +               printf("ERROR: Failed to create pipe\n");
> +               return -1;
> +       }
> +       if (pipe(switch_done)) {
> +               printf("ERROR: Failed to create pipe\n");
> +               return -1;
> +       }
> +
> +       pid = fork();
> +       if (pid == 0) {
> +               char done = 'y';
> +
> +               fd = open(PKS_TEST_FILE, O_RDWR);
> +               if (fd < 0) {
> +                       printf("ERROR: cannot open %s\n",
> PKS_TEST_FILE);
> +                       return -1;

When this happens, the error is printed, but the parent process just
hangs forever. Might make it hard to script running all the selftests.

Also, the other x86 selftests mostly use [RUN], [INFO], [OK], [FAIL],
[SKIP] and [OK] in their print statements. Probably should stick to the
pattern across all the print statements. This is probably a "[SKIP]".
Just realized I've omitted the "[]" in the CET series too.

> +               }
> +
> +               cpu = sched_getcpu();
> +               printf("Child running on cpu %d...\n", cpu);
> +
> +               /* Allocate and run test. */
> +               write(fd, RUN_SINGLE, 1);
> +
> +               /* Arm for context switch test */
> +               write(fd, ARM_CTX_SWITCH, 1);
> +
> +               printf("   tell parent to go\n");
> +               write(setup_done[1], &done, sizeof(done));
> +
> +               /* Context switch out... */
> +               printf("   Waiting for parent...\n");
> +               read(switch_done[0], &done, sizeof(done));
> +
> +               /* Check msr restored */
> +               printf("Checking result\n");
> +               write(fd, CHECK_CTX_SWITCH, 1);
> +
> +               read(fd, result, 10);
> +               printf("   #PF, context switch, pkey allocation and
> free tests: %s\n", result);
> +               if (!strncmp(result, "PASS", 10)) {
> +                       rc = -1;
> +                       done = 'F';
> +               }
> +
> +               /* Signal result */
> +               write(setup_done[1], &done, sizeof(done));
> +       } else {
> +               char done = 'y';
> +
> +               read(setup_done[0], &done, sizeof(done));
> +               cpu = sched_getcpu();
> +               printf("Parent running on cpu %d\n", cpu);
> +
> +               fd = open(PKS_TEST_FILE, O_RDWR);
> +               if (fd < 0) {
> +                       printf("ERROR: cannot open %s\n",
> PKS_TEST_FILE);
> +                       return -1;
> +               }
> +
> +               /* run test with the same pkey */
> +               write(fd, RUN_SINGLE, 1);
> +
> +               printf("   Signaling child.\n");
> +               write(switch_done[1], &done, sizeof(done));
> +
> +               /* Wait for result */
> +               read(setup_done[0], &done, sizeof(done));
> +               if (done == 'F')
> +                       rc = -1;
> +       }



^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 19/44] mm/pkeys: PKS Testing, add pks_mk_*() tests
  2022-01-27 17:54 ` [PATCH V8 19/44] mm/pkeys: PKS Testing, add pks_mk_*() tests ira.weiny
@ 2022-02-01 17:45   ` Dave Hansen
  2022-02-18  5:34     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-02-01 17:45 UTC (permalink / raw)
  To: ira.weiny, Dave Hansen, H. Peter Anvin, Dan Williams
  Cc: Fenghua Yu, Rick Edgecombe, linux-kernel

On 1/27/22 09:54, ira.weiny@intel.com wrote:
>  bool pks_test_callback(void)
>  {
> -	return false;
> +	bool armed = (test_armed_key != 0);
> +
> +	if (armed) {
> +		pks_mk_readwrite(test_armed_key);
> +		fault_cnt++;
> +	}
> +
> +	return armed;
> +}

Where's the locking for all this?  I don't think we need anything fancy,
but is there anything preventing the test from being started from
multiple threads at the same time?  I think a simple global test mutex
would probably suffice.

Also, pks_test_callback() needs at least a comment or two about what
it's doing.

Does this work if you have a test armed and then you get an unrelated
PKS fault on another CPU?  I think this will disarm the test from the
unrelated thread.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 20/44] mm/pkeys: Add PKS test for context switching
  2022-01-27 17:54 ` [PATCH V8 20/44] mm/pkeys: Add PKS test for context switching ira.weiny
  2022-02-01 17:43   ` Edgecombe, Rick P
@ 2022-02-01 17:47   ` Edgecombe, Rick P
  2022-02-01 19:52     ` Edgecombe, Rick P
  2022-02-18  6:02     ` Ira Weiny
  1 sibling, 2 replies; 145+ messages in thread
From: Edgecombe, Rick P @ 2022-02-01 17:47 UTC (permalink / raw)
  To: hpa, Williams, Dan J, Weiny, Ira, dave.hansen; +Cc: Yu, Fenghua, linux-kernel

On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
>  lib/pks/pks_test.c                     |  74 +++++++++++

Since this only tests a specific operation of pks, should it be named
more specifically? Or it might be handy if it ran all the PKS tests,
even though the others can be run directly.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 26/44] x86/fault: Print PKS MSR on fault
  2022-01-27 17:54 ` [PATCH V8 26/44] x86/fault: Print PKS MSR on fault ira.weiny
@ 2022-02-01 18:13   ` Edgecombe, Rick P
  2022-02-18  6:01     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Edgecombe, Rick P @ 2022-02-01 18:13 UTC (permalink / raw)
  To: hpa, Williams, Dan J, Weiny, Ira, dave.hansen; +Cc: Yu, Fenghua, linux-kernel

On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> +       if (error_code & X86_PF_PK)
> +               pks_dump_fault_info(regs);
> +

If the kernel makes an errant accesses to a userspace address with PKU
enabled and the usersapce page marked AD, it should oops and get here,
but will the X86_PF_PK bit be set even if smap is the real cause? Per
the SDM, it sounds like it would:
"
For accesses to user-mode addresses, the flag is set if
(1) CR4.PKE = 1;
(2) the linear address has protection key i; and
(3) the PKRU register (see Section 4.6.2) is such that either
	(a) ADi = 1; or
	(b) the following all hold: 
		(i) WDi = 1;
		(ii) the access is a write access; and 
		(iii) either CR0.WP = 1 or the access causing the
                      page-fault exception was a user-mode access.
"

...and then this somewhat confusingly dumps the pks register. Is that
the real behavior?

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 36/44] memremap_pages: Reserve a PKS PKey for eventual use by PMEM
  2022-01-27 17:54 ` [PATCH V8 36/44] memremap_pages: Reserve a PKS PKey for eventual use by PMEM ira.weiny
@ 2022-02-01 18:35   ` Edgecombe, Rick P
  2022-02-04 17:12     ` Dan Williams
  0 siblings, 1 reply; 145+ messages in thread
From: Edgecombe, Rick P @ 2022-02-01 18:35 UTC (permalink / raw)
  To: hpa, Williams, Dan J, Weiny, Ira, dave.hansen; +Cc: Yu, Fenghua, linux-kernel

On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
>  enum pks_pkey_consumers {
> -       PKS_KEY_DEFAULT         = 0, /* Must be 0 for default PTE
> values */
> -       PKS_KEY_TEST            = 1,
> -       PKS_KEY_NR_CONSUMERS    = 2,
> +       PKS_KEY_DEFAULT                 = 0, /* Must be 0 for default
> PTE values */
> +       PKS_KEY_TEST                    = 1,
> +       PKS_KEY_PGMAP_PROTECTION        = 2,
> +       PKS_KEY_NR_CONSUMERS            = 3,
>  };

The c spec says that any enum member that doesn't have an "=" will be
one more than the previous member. As a consequence you can leave the
"=" off PKS_KEY_NR_CONSUMERS and it will get auto adjusted when you add
more like this.

I know we've gone around and around on this, but why also specify the
value for each key? They should auto increment and the first one is
guaranteed to be zero.

Otherwise this doesn't use any of the features of "enum", it's just a
verbose series of const int's.


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 20/44] mm/pkeys: Add PKS test for context switching
  2022-02-01 17:47   ` Edgecombe, Rick P
@ 2022-02-01 19:52     ` Edgecombe, Rick P
  2022-02-18  6:03       ` Ira Weiny
  2022-02-18  6:02     ` Ira Weiny
  1 sibling, 1 reply; 145+ messages in thread
From: Edgecombe, Rick P @ 2022-02-01 19:52 UTC (permalink / raw)
  To: hpa, Williams, Dan J, Weiny, Ira, dave.hansen; +Cc: Yu, Fenghua, linux-kernel

On Tue, 2022-02-01 at 09:47 -0800, Edgecombe, Richard P wrote:
> On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> >  lib/pks/pks_test.c                     |  74 +++++++++++
> 
> Since this only tests a specific operation of pks, should it be named
> more specifically? Or it might be handy if it ran all the PKS tests,
> even though the others can be run directly.

Oops, I meant "tools/testing/selftests/x86/test_pks.c"

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 02/44] Documentation/protection-keys: Clean up documentation for User Space pkeys
  2022-01-28 22:39   ` Dave Hansen
@ 2022-02-01 23:49     ` Ira Weiny
  2022-02-01 23:54       ` Dave Hansen
  0 siblings, 1 reply; 145+ messages in thread
From: Ira Weiny @ 2022-02-01 23:49 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Jan 28, 2022 at 02:39:10PM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> > +PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable
> > +Processor" Server CPUs and later.  And it will be available in future
> > +non-server Intel parts and future AMD processors.
> 
> The non-server parts are quite available these days.  I'm typing on one
> right now:
> 
> 	model name	: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
> 
> This can probably say:
> 
> Protection Keys for Userspace (PKU) can be found on:
>  * Intel server CPUs, Skylake and later
>  * Intel client CPUs, Tiger Lake (11th Gen Core) and later
>  * Future AMD CPUs
> 
> It would be great if the AMD folks can elaborate on that a bit, but I
> understand it might not be possible if the CPUs aren't out yet.

Updated.

But I'm leery about putting in any information about AMD CPU's.  Who could I
ask directly?

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 02/44] Documentation/protection-keys: Clean up documentation for User Space pkeys
  2022-02-01 23:49     ` Ira Weiny
@ 2022-02-01 23:54       ` Dave Hansen
  0 siblings, 0 replies; 145+ messages in thread
From: Dave Hansen @ 2022-02-01 23:54 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On 2/1/22 15:49, Ira Weiny wrote:
>> Protection Keys for Userspace (PKU) can be found on:
>>  * Intel server CPUs, Skylake and later
>>  * Intel client CPUs, Tiger Lake (11th Gen Core) and later
>>  * Future AMD CPUs
>>
>> It would be great if the AMD folks can elaborate on that a bit, but I
>> understand it might not be possible if the CPUs aren't out yet.
> Updated.
> 
> But I'm leery about putting in any information about AMD CPU's.  Who could I
> ask directly?

I forgot who from AMD was chiming in on pkeys.  So, I did:

	git log --author=amd.com arch/x86/

and searched for 'keys'.  I came up with this pretty quickly:

> commit 38f3e775e9c242f5430a9c08c11be7577f63a41c
> Author: Babu Moger <babu.moger@amd.com>
> Date:   Thu May 28 11:08:23 2020 -0500
> 
>     x86/Kconfig: Update config and kernel doc for MPK feature on AMD
>     
>     AMD's next generation of EPYC processors support the MPK (Memory
>     Protection Keys) feature. Update the dependency and documentation.
>     
>     Signed-off-by: Babu Moger <babu.moger@amd.com>
>     Signed-off-by: Borislav Petkov <bp@suse.de>
>     Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
>     Link: https://lkml.kernel.org/r/159068199556.26992.17733929401377275140.stgit@naples-babu.amd.com
You also don't have to go out and find this information for your
documentation.  Just say "future" and then poke the AMD folks.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 03/44] x86/pkeys: Create pkeys_common.h
  2022-01-28 22:43   ` Dave Hansen
@ 2022-02-02  1:00     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-02  1:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Jan 28, 2022 at 02:43:54PM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in
> > similar fashions and can share common defines.  Specifically PKS and PKU
> > each have:
> > 
> > 	1. A single control register
> > 	2. The same number of keys
> > 	3. The same number of bits in the register per key
> > 	4. Access and Write disable in the same bit locations
> > 
> > Given the above, share all the macros that synthesize and manipulate
> > register values between the two features.  Share these defines by moving
> > them into a new header, change their names to reflect the common use,
> > and include the header where needed.
> 
> I'd probably include *one* more sentence to prime the reader for the
> pattern they are about to see.  Perhaps:
> 
> 	This mostly takes the form of converting names from the PKU-
> 	specific "PKRU" to the U/S-agnostic "PKR".

Fair enough.

> 
> > Also while editing the code remove the use of 'we' from comments being
> > touched.
> > 
> > NOTE the checkpatch errors are ignored for the init_pkru_value to
> > align the values in the code.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Either way, this looks fine:
> 
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Thanks!
Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 04/44] x86/pkeys: Add additional PKEY helper macros
  2022-01-28 22:47   ` Dave Hansen
@ 2022-02-02 20:21     ` Ira Weiny
  2022-02-02 20:26       ` Dave Hansen
  0 siblings, 1 reply; 145+ messages in thread
From: Ira Weiny @ 2022-02-02 20:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Jan 28, 2022 at 02:47:30PM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> > +#define PKR_AD_KEY(pkey)	(PKR_AD_BIT << PKR_PKEY_SHIFT(pkey))
> > +#define PKR_WD_KEY(pkey)	(PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))
> 
> I don't _hate_ this, but naming here is wonky for me.  PKR_WD_KEY reads
> to me as "pkey register write-disable key", as in, please write-disable
> this key, or maybe "make a write-disable key".

Ok...  that is reasonable...

> 
> It's generating a mask, so I'd probably name it:
> 
> #define PKR_WD_MASK(pkey)	(PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))
> 
> Which says, "generate a write-disabled mask for this pkey".

I think the confusion comes from me having used these as mask values rather
than what PKR_AD_KEY() was intended to be used for.

In the previous patch PKR_AD_KEY() is used to set up the default user pkey
value...

u32 init_pkru_value = PKR_AD_KEY( 1) | PKR_AD_KEY( 2) | PKR_AD_KEY( 3) |
		      PKR_AD_KEY( 4) | PKR_AD_KEY( 5) | PKR_AD_KEY( 6) |
		      PKR_AD_KEY( 7) | PKR_AD_KEY( 8) | PKR_AD_KEY( 9) |
		      PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) |
		      PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15);

I'll have to think about it but I don't think I like the following...

u32 init_pkru_value = PKR_AD_MASK( 1) | PKR_AD_MASK( 2) | PKR_AD_MASK( 3) |
		      PKR_AD_MASK( 4) | PKR_AD_MASK( 5) | PKR_AD_MASK( 6) |
		      PKR_AD_MASK( 7) | PKR_AD_MASK( 8) | PKR_AD_MASK( 9) |
		      PKR_AD_MASK(10) | PKR_AD_MASK(11) | PKR_AD_MASK(12) |
		      PKR_AD_MASK(13) | PKR_AD_MASK(14) | PKR_AD_MASK(15);

It seems odd to me.  Does it seem odd to you?

Looking at the final code I think I'm going to just drop the usages in this
patch and add PKR_WD_KEY() where it is used first.

Also, how about PKR_KEY_INIT_{AD|WD|RW}() as a name?

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 05/44] x86/fpu: Refactor arch_set_user_pkey_access()
  2022-01-28 22:50   ` Dave Hansen
@ 2022-02-02 20:22     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-02 20:22 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Jan 28, 2022 at 02:50:44PM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Both PKU and PKS update their register values in the same way.  They can
> > therefore share the update code.
> > 
> > Define a helper, pkey_update_pkval(), which will be used to support both
> > Protection Key User (PKU) and the new Protection Key for Supervisor
> > (PKS) in subsequent patches.
> > 
> > pkey_update_pkval() contributed by Thomas
> > 
> > Co-developed-by: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Looks better than my original code.  Waaaaaay simpler.
> 
> Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

Thanks,
Ira


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 04/44] x86/pkeys: Add additional PKEY helper macros
  2022-02-02 20:21     ` Ira Weiny
@ 2022-02-02 20:26       ` Dave Hansen
  2022-02-02 20:28         ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-02-02 20:26 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On 2/2/22 12:21, Ira Weiny wrote:
> On Fri, Jan 28, 2022 at 02:47:30PM -0800, Dave Hansen wrote:
>> #define PKR_WD_MASK(pkey)	(PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))
>>
>> Which says, "generate a write-disabled mask for this pkey".
> 
> I think the confusion comes from me having used these as mask values rather
> than what PKR_AD_KEY() was intended to be used for.
> 
> In the previous patch PKR_AD_KEY() is used to set up the default user pkey
> value...
> 
> u32 init_pkru_value = PKR_AD_KEY( 1) | PKR_AD_KEY( 2) | PKR_AD_KEY( 3) |
> 		      PKR_AD_KEY( 4) | PKR_AD_KEY( 5) | PKR_AD_KEY( 6) |
> 		      PKR_AD_KEY( 7) | PKR_AD_KEY( 8) | PKR_AD_KEY( 9) |
> 		      PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) |
> 		      PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15);
> 

Hah, I'm complaining about my own code.

> u32 init_pkru_value = PKR_AD_MASK( 1) | PKR_AD_MASK( 2) | PKR_AD_MASK( 3) |
> 		      PKR_AD_MASK( 4) | PKR_AD_MASK( 5) | PKR_AD_MASK( 6) |
> 		      PKR_AD_MASK( 7) | PKR_AD_MASK( 8) | PKR_AD_MASK( 9) |
> 		      PKR_AD_MASK(10) | PKR_AD_MASK(11) | PKR_AD_MASK(12) |
> 		      PKR_AD_MASK(13) | PKR_AD_MASK(14) | PKR_AD_MASK(15);
> 
> It seems odd to me.  Does it seem odd to you?

Looks OK to me.  It's build a "value" out of a bunch of individual masks.

> Looking at the final code I think I'm going to just drop the usages in this
> patch and add PKR_WD_KEY() where it is used first.
> 
> Also, how about PKR_KEY_INIT_{AD|WD|RW}() as a name?

For the PKR_AD_KEY() macro?

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 04/44] x86/pkeys: Add additional PKEY helper macros
  2022-02-02 20:26       ` Dave Hansen
@ 2022-02-02 20:28         ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-02 20:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Wed, Feb 02, 2022 at 12:26:44PM -0800, Dave Hansen wrote:
> On 2/2/22 12:21, Ira Weiny wrote:
> > On Fri, Jan 28, 2022 at 02:47:30PM -0800, Dave Hansen wrote:
> >> #define PKR_WD_MASK(pkey)	(PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))
> >>
> >> Which says, "generate a write-disabled mask for this pkey".
> > 
> > I think the confusion comes from me having used these as mask values rather
> > than what PKR_AD_KEY() was intended to be used for.
> > 
> > In the previous patch PKR_AD_KEY() is used to set up the default user pkey
> > value...
> > 
> > u32 init_pkru_value = PKR_AD_KEY( 1) | PKR_AD_KEY( 2) | PKR_AD_KEY( 3) |
> > 		      PKR_AD_KEY( 4) | PKR_AD_KEY( 5) | PKR_AD_KEY( 6) |
> > 		      PKR_AD_KEY( 7) | PKR_AD_KEY( 8) | PKR_AD_KEY( 9) |
> > 		      PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) |
> > 		      PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15);
> > 
> 
> Hah, I'm complaining about my own code.
> 
> > u32 init_pkru_value = PKR_AD_MASK( 1) | PKR_AD_MASK( 2) | PKR_AD_MASK( 3) |
> > 		      PKR_AD_MASK( 4) | PKR_AD_MASK( 5) | PKR_AD_MASK( 6) |
> > 		      PKR_AD_MASK( 7) | PKR_AD_MASK( 8) | PKR_AD_MASK( 9) |
> > 		      PKR_AD_MASK(10) | PKR_AD_MASK(11) | PKR_AD_MASK(12) |
> > 		      PKR_AD_MASK(13) | PKR_AD_MASK(14) | PKR_AD_MASK(15);
> > 
> > It seems odd to me.  Does it seem odd to you?
> 
> Looks OK to me.  It's build a "value" out of a bunch of individual masks.
> 
> > Looking at the final code I think I'm going to just drop the usages in this
> > patch and add PKR_WD_KEY() where it is used first.
> > 
> > Also, how about PKR_KEY_INIT_{AD|WD|RW}() as a name?
> 
> For the PKR_AD_KEY() macro?

Yes if I drop this patch then the only place these are used is to initialize
the registers.

Ira


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 42/44] dax: Stray access protection for dax_direct_access()
  2022-01-27 17:55 ` [PATCH V8 42/44] dax: Stray access protection for dax_direct_access() ira.weiny
@ 2022-02-04  5:19   ` Dan Williams
  2022-03-01 18:13     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dan Williams @ 2022-02-04  5:19 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
>
> From: Ira Weiny <ira.weiny@intel.com>
>
> dax_direct_access() provides a way to obtain the direct map address of
> PMEM memory.  Coordinate PKS protection with dax_direct_access() of
> protected devmap pages.
>
> Introduce 3 new dax_operation calls .map_protected .mk_readwrite and
> .mk_noaccess. These 3 calls do not have to be implemented by the dax
> provider if no protection is implemented.
>
> Threads of execution can use dax_mk_{readwrite,noaccess}() to relax the
> protection of the dax device and allow direct use of the kaddr returned
> from dax_direct_access().  The dax_mk_{readwrite,noaccess}() calls only
> need to be used to guard actual access to the memory.  Other uses of
> dax_direct_access() do not need to use these guards.
>
> For users who require a permanent address to the dax device such as the
> DM write cache.  dax_map_protected() indicates that the dax device has
> additional protections and that user should create it's own permanent
> mapping of the memory.  Update the DM write cache code to create this
> permanent mapping.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
[..]
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 9fc5f99a0ae2..261af298f89f 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -30,6 +30,10 @@ struct dax_operations {
>                         sector_t, sector_t);
>         /* zero_page_range: required operation. Zero page range   */
>         int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
> +
> +       bool (*map_protected)(struct dax_device *dax_dev);
> +       void (*mk_readwrite)(struct dax_device *dax_dev);
> +       void (*mk_noaccess)(struct dax_device *dax_dev);

So the dax code just recently jettisoned -the >copy_{to,from}_iter()
ops and it would be shame to grow more ops. Given that the
implementation is pgmap generic I think all that is needed is way to
go from a daxdev to a pgmap and then use the pgmap helpers directly
rather than indirecting through the pmem driver just to get the pgmap.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 32/44] memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION
  2022-01-27 17:54 ` [PATCH V8 32/44] memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION ira.weiny
@ 2022-02-04 15:49   ` Dan Williams
  0 siblings, 0 replies; 145+ messages in thread
From: Dan Williams @ 2022-02-04 15:49 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
>
> From: Ira Weiny <ira.weiny@intel.com>
>
> The persistent memory (PMEM) driver uses the memremap_pages facility to
> provide 'struct page' metadata (vmemmap) for PMEM.  Given that PMEM
> capacity maybe orders of magnitude higher capacity than System RAM it

s/maybe/may be/

> presents a large vulnerability surface to stray writes.  Unlike stray
> writes to System RAM, which may result in a crash or other undesirable
> behavior, stray writes to PMEM additionally are more likely to result in
> permanent data loss. Reboot is not a remediation for PMEM corruption
> like it is for System RAM.
>
> Given that PMEM access from the kernel is limited to a constrained set
> of locations (PMEM driver, Filesystem-DAX, and direct-I/O to a DAX
> page), it is amenable to supervisor pkey protection.
>
> Not all systems with PMEM will want additional protections.  Therefore,
> add a Kconfig option for the user to configure the additional devmap
> protections.
>
> Only systems with supervisor protection keys (PKS) are able to support
> this new protection so depend on ARCH_HAS_SUPERVISOR_PKEYS.
> Furthermore, select ARCH_ENABLE_SUPERVISOR_PKEYS to ensure that the
> architecture support is enabled if PMEM is the only use case.
>
> Only PMEM which is advertised to the memory subsystem needs this
> protection.  Therefore, the feature depends on NVDIMM_PFN.
>
> A default of (NVDIMM_PFN && ARCH_HAS_SUPERVISOR_PKEYS) was suggested but
> logically that is the same as saying default 'yes' because both
> NVDIMM_PFN and ARCH_HAS_SUPERVISOR_PKEYS are required.  Therefore a
> default of 'yes' is used.

It still feels odd to default this to y just because the ARCH enables
it. I think it's fine for this to require explicit opt-in especially
because it has non-zero overhead and there are other PKEYS users on
the horizon.

>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes for V8
>         Split this out from
>                 [PATCH V7 13/18] memremap_pages: Add access protection via supervisor Protection Keys (PKS)
> ---
>  mm/Kconfig | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 46f2bb15aa4e..67e0264acf7d 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -776,6 +776,24 @@ config ZONE_DEVICE
>
>           If FS_DAX is enabled, then say Y.
>
> +config DEVMAP_ACCESS_PROTECTION
> +       bool "Access protection for memremap_pages()"
> +       depends on NVDIMM_PFN
> +       depends on ARCH_HAS_SUPERVISOR_PKEYS
> +       select ARCH_ENABLE_SUPERVISOR_PKEYS
> +       default y
> +
> +       help
> +         Enable extra protections on device memory.  This protects against
> +         unintended access to devices such as a stray writes.  This feature is
> +         particularly useful to protect against corruption of persistent
> +         memory.
> +
> +         This depends on architecture support of supervisor PKeys and has no
> +         overhead if the architecture does not support them.
> +
> +         If you have persistent memory say 'Y'.
> +
>  config DEV_PAGEMAP_OPS
>         bool
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 33/44] memremap_pages: Introduce pgmap_protection_available()
  2022-01-27 17:54 ` [PATCH V8 33/44] memremap_pages: Introduce pgmap_protection_available() ira.weiny
@ 2022-02-04 16:19   ` Dan Williams
  2022-02-28 16:59     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dan Williams @ 2022-02-04 16:19 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
>
> From: Ira Weiny <ira.weiny@intel.com>
>
> Users will need to specify that they want their dev_pagemap pages
> protected by specifying a flag in (struct dev_pagemap)->flags.  However,
> it is more efficient to know if that protection is available prior to
> requesting it and failing the mapping.
>
> Define pgmap_protection_available() for users to check if protection is
> available to be used.  The name of pgmap_protection_available() was
> specifically chosen to isolate the implementation of the protection from
> higher level users.  However, the current implementation simply calls
> pks_available() to determine if it can support protection.
>
> It was considered to have users specify the flag and check if the
> dev_pagemap object returned was protected or not.  But this was
> considered less efficient than a direct check beforehand.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes for V8
>         Split this out to it's own patch.
>         s/pgmap_protection_enabled/pgmap_protection_available
> ---
>  include/linux/mm.h | 13 +++++++++++++
>  mm/memremap.c      | 11 +++++++++++
>  2 files changed, 24 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e1a84b1e6787..2ae99bee6e82 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1143,6 +1143,19 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
>                 page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
>  }
>
> +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> +
> +bool pgmap_protection_available(void);
> +
> +#else
> +
> +static inline bool pgmap_protection_available(void)
> +{
> +       return false;
> +}
> +
> +#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
> +
>  /* 127: arbitrary random number, small enough to assemble well */
>  #define folio_ref_zero_or_close_to_overflow(folio) \
>         ((unsigned int) folio_ref_count(folio) + 127u <= 127u)
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 6aa5f0c2d11f..c13b3b8a0048 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -6,6 +6,7 @@
>  #include <linux/memory_hotplug.h>
>  #include <linux/mm.h>
>  #include <linux/pfn_t.h>
> +#include <linux/pkeys.h>
>  #include <linux/swap.h>
>  #include <linux/mmzone.h>
>  #include <linux/swapops.h>
> @@ -63,6 +64,16 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
>  }
>  #endif /* CONFIG_DEV_PAGEMAP_OPS */
>
> +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> +
> +bool pgmap_protection_available(void)
> +{
> +       return pks_available();
> +}
> +EXPORT_SYMBOL_GPL(pgmap_protection_available);

Any reason this was chosen to be an out-of-line function? Doesn't this
defeat the performance advantages of static_cpu_has()?

> +
> +#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
> +
>  static void pgmap_array_delete(struct range *range)
>  {
>         xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 36/44] memremap_pages: Reserve a PKS PKey for eventual use by PMEM
  2022-02-01 18:35   ` Edgecombe, Rick P
@ 2022-02-04 17:12     ` Dan Williams
  2022-02-05  5:40       ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dan Williams @ 2022-02-04 17:12 UTC (permalink / raw)
  To: Edgecombe, Rick P; +Cc: hpa, Weiny, Ira, dave.hansen, Yu, Fenghua, linux-kernel

On Tue, Feb 1, 2022 at 10:35 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> >  enum pks_pkey_consumers {
> > -       PKS_KEY_DEFAULT         = 0, /* Must be 0 for default PTE
> > values */
> > -       PKS_KEY_TEST            = 1,
> > -       PKS_KEY_NR_CONSUMERS    = 2,
> > +       PKS_KEY_DEFAULT                 = 0, /* Must be 0 for default
> > PTE values */
> > +       PKS_KEY_TEST                    = 1,
> > +       PKS_KEY_PGMAP_PROTECTION        = 2,
> > +       PKS_KEY_NR_CONSUMERS            = 3,
> >  };
>
> The c spec says that any enum member that doesn't have an "=" will be
> one more than the previous member. As a consequence you can leave the
> "=" off PKS_KEY_NR_CONSUMERS and it will get auto adjusted when you add
> more like this.
>
> I know we've gone around and around on this, but why also specify the
> value for each key? They should auto increment and the first one is
> guaranteed to be zero.
>
> Otherwise this doesn't use any of the features of "enum", it's just a
> verbose series of const int's.

Going further, this can also build in support for dynamically (at
build time) freeing keys based on config, something like:

enum {
#if IS_ENABLED(CONFIG_PKS_TEST)
PKS_KEY_TEST,
#endif
#if IS_ENABLED(CONFIG_DEVMAP_PROTECTION)
PKS_KEY_PGMAP_PROTECTION,
#endif
PKS_KEY_NR_CONSUMERS,
}

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 37/44] memremap_pages: Set PKS PKey in PTEs if PGMAP_PROTECTIONS is requested
  2022-01-27 17:54 ` [PATCH V8 37/44] memremap_pages: Set PKS PKey in PTEs if PGMAP_PROTECTIONS is requested ira.weiny
@ 2022-02-04 17:41   ` Dan Williams
  2022-03-01 18:15     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dan Williams @ 2022-02-04 17:41 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
>
> From: Ira Weiny <ira.weiny@intel.com>
>
> When the user requests protections the dev_pagemap mappings need to have
> a PKEY set.
>
> Define devmap_protection_adjust_pgprot() to add the PKey to the page
> protections.  Call it when PGMAP_PROTECTIONS is requested when remapping
> pages.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---

Does this patch have a reason to exist independent of the patch that
introduced devmap_protection_enable()?

Otherwise looks ok.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 38/44] memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls
  2022-01-27 17:54 ` [PATCH V8 38/44] memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls ira.weiny
@ 2022-02-04 18:35   ` Dan Williams
  2022-02-05  0:09     ` Ira Weiny
  2022-02-22 22:05     ` Ira Weiny
  0 siblings, 2 replies; 145+ messages in thread
From: Dan Williams @ 2022-02-04 18:35 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
>
> From: Ira Weiny <ira.weiny@intel.com>
>
> Users will need a way to flag valid access to pages which have been
> protected with PGMAP protections.  Provide this by defining pgmap_mk_*()
> accessor functions.

I find the ambiguous use of "Users" not helpful to set the context. How about:

A thread that wants to access memory protected by PGMAP protections
must first enable access, and then disable access when it is done.

>
> pgmap_mk_{readwrite|noaccess}() take a struct page for convenience.
> They determine if the page is protected by dev_pagemap protections.  If
> so, they perform the requested operation.
>
> In addition, the lower level __pgmap_* functions are exported.  They
> take the dev_pagemap object directly for internal users who have
> knowledge of the of the dev_pagemap.
>
> All changes in the protections must be through the above calls.  They
> abstract the protection implementation (currently the PKS api) from the
> upper layer users.
>
> Furthermore, the calls are nestable by the use of a per task reference
> count.  This ensures that the first call to re-enable protection does
> not 'break' the last access of the device memory.
>
> Access to device memory during exceptions (#PF) is expected only from
> user faults.  Therefore there is no need to maintain the reference count
> when entering or exiting exceptions.  However, reference counting will
> occur during the exception.  Recall that protection is automatically
> enabled during exceptions by the PKS core.[1]
>
> NOTE: It is not anticipated that any code paths will directly nest these
> calls.  For this reason multiple reviewers, including Dan and Thomas,
> asked why this reference counting was needed at this level rather than
> in a higher level call such as kmap_{atomic,local_page}().  The reason
> is that pgmap_mk_readwrite() could nest with regards to other callers of
> pgmap_mk_*() such as kmap_{atomic,local_page}().  Therefore push this
> reference counting to the lower level and just ensure that these calls
> are nestable.

I still don't think that explains why task struct has a role to play
here, see below.

Another missing bit of clarification, maybe I missed it, is why are
the protections toggled between read-write and noaccess. For
stray-write protection toggling between read-write and read-only is
sufficient. I can imagine speculative execution and debug rationales
for noaccess, but those should be called out explicitly.

>
> [1] https://lore.kernel.org/lkml/20210401225833.566238-9-ira.weiny@intel.com/
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes for V8
>         Split these functions into their own patch.
>                 This helps to clarify the commit message and usage.
> ---
>  include/linux/mm.h    | 34 ++++++++++++++++++++++++++++++++++
>  include/linux/sched.h |  7 +++++++
>  init/init_task.c      |  3 +++
>  mm/memremap.c         | 14 ++++++++++++++
>  4 files changed, 58 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6e4a2758e3d3..60044de77c54 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,10 +1162,44 @@ static inline bool devmap_protected(struct page *page)
>         return false;
>  }
>
> +void __pgmap_mk_readwrite(struct dev_pagemap *pgmap);
> +void __pgmap_mk_noaccess(struct dev_pagemap *pgmap);
> +
> +static inline bool pgmap_check_pgmap_prot(struct page *page)
> +{
> +       if (!devmap_protected(page))
> +               return false;
> +
> +       /*
> +        * There is no known use case to change permissions in an irq for pgmap
> +        * pages
> +        */
> +       lockdep_assert_in_irq();
> +       return true;
> +}
> +
> +static inline void pgmap_mk_readwrite(struct page *page)
> +{
> +       if (!pgmap_check_pgmap_prot(page))
> +               return;
> +       __pgmap_mk_readwrite(page->pgmap);
> +}
> +static inline void pgmap_mk_noaccess(struct page *page)
> +{
> +       if (!pgmap_check_pgmap_prot(page))
> +               return;
> +       __pgmap_mk_noaccess(page->pgmap);
> +}
> +
>  bool pgmap_protection_available(void);
>
>  #else
>
> +static inline void __pgmap_mk_readwrite(struct dev_pagemap *pgmap) { }
> +static inline void __pgmap_mk_noaccess(struct dev_pagemap *pgmap) { }
> +static inline void pgmap_mk_readwrite(struct page *page) { }
> +static inline void pgmap_mk_noaccess(struct page *page) { }
> +
>  static inline bool pgmap_protection_available(void)
>  {
>         return false;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f5b2be39a78c..5020ed7e67b7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1492,6 +1492,13 @@ struct task_struct {
>         struct callback_head            l1d_flush_kill;
>  #endif
>
> +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> +       /*
> +        * NOTE: pgmap_prot_count is modified within a single thread of
> +        * execution.  So it does not need to be atomic_t.
> +        */
> +       u32                             pgmap_prot_count;
> +#endif

It's not at all clear why the task struct needs to be burdened with
this accounting. Given that a devmap instance is needed to manage page
protections, why not move the nested protection tracking to a percpu
variable relative to an @pgmap arg? Something like:

void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
{
       migrate_disable();
       preempt_disable();
       if (this_cpu_add_return(pgmap->pgmap_prot_count, 1) == 1)
               pks_mk_readwrite(PKS_KEY_PGMAP_PROTECTION);
}
EXPORT_SYMBOL_GPL(__pgmap_mk_readwrite);

void __pgmap_mk_noaccess(struct dev_pagemap *pgmap)
{
       if (!this_cpu_sub_return(pgmap->pgmap_prot_count, 1))
               pks_mk_noaccess(PKS_KEY_PGMAP_PROTECTION);
       preempt_enable();
       migrate_enable();
}
EXPORT_SYMBOL_GPL(__pgmap_mk_noaccess);

The naming, which I had a hand in, is not aging well. When I see "mk"
I expect it to be building some value like a page table entry that
will be installed later. These helpers are directly enabling and
disabling access and are meant to be called symmetrically. So I would
expect symmetric names like:

pgmap_enable_access()
pgmap_disable_access()


>         /*
>          * New fields for task_struct should be added above here, so that
>          * they are included in the randomized portion of task_struct.
> diff --git a/init/init_task.c b/init/init_task.c
> index 73cc8f03511a..948b32cf8139 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -209,6 +209,9 @@ struct task_struct init_task
>  #ifdef CONFIG_SECCOMP_FILTER
>         .seccomp        = { .filter_count = ATOMIC_INIT(0) },
>  #endif
> +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> +       .pgmap_prot_count = 0,
> +#endif
>  };
>  EXPORT_SYMBOL(init_task);
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index d3e6f328a711..b75c4f778c59 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -96,6 +96,20 @@ static void devmap_protection_disable(void)
>         static_branch_dec(&dev_pgmap_protection_static_key);
>  }
>
> +void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
> +{
> +       if (!current->pgmap_prot_count++)
> +               pks_mk_readwrite(PKS_KEY_PGMAP_PROTECTION);
> +}
> +EXPORT_SYMBOL_GPL(__pgmap_mk_readwrite);
> +
> +void __pgmap_mk_noaccess(struct dev_pagemap *pgmap)
> +{
> +       if (!--current->pgmap_prot_count)
> +               pks_mk_noaccess(PKS_KEY_PGMAP_PROTECTION);
> +}
> +EXPORT_SYMBOL_GPL(__pgmap_mk_noaccess);
> +
>  bool pgmap_protection_available(void)
>  {
>         return pks_available();
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 39/44] memremap_pages: Add memremap.pks_fault_mode
  2022-01-27 17:55 ` [PATCH V8 39/44] memremap_pages: Add memremap.pks_fault_mode ira.weiny
  2022-02-01  1:16   ` Edgecombe, Rick P
@ 2022-02-04 19:01   ` Dan Williams
  2022-03-02  2:00     ` Ira Weiny
  1 sibling, 1 reply; 145+ messages in thread
From: Dan Williams @ 2022-02-04 19:01 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
>
> From: Ira Weiny <ira.weiny@intel.com>
>
> Some systems may be using pmem in unanticipated ways.  As such, it is
> possible an foreseen code path to violate the restrictions of the PMEM
> PKS protections.

These sentences do not parse for me. How about:

"When PKS protections for PMEM are enabled the kernel may capture
stray writes, or it may capture false positive access violations. An
example of a false positive access violation is a code path that
neglects to call kmap_{atomic,local_page}, but is otherwise a valid
access. In the false positive scenario there is no actual risk to data
integrity, but the kernel still needs to make a decision as to whether
to report the access violation and continue, or treat the violation as
fatal. That policy decision is captured in a new pks_fault_mode kernel
parameter."

>
> In order to provide a more seamless integration of the PMEM PKS feature

Not sure what "seamless integration" means in this context?

> provide a pks_fault_mode that allows for a relaxed mode should a
> previously working feature fault on the PKS protected PMEM.
>
> 2 modes are available:
>
>         'relaxed' (default) -- WARN_ONCE, removed the protections, and
>         continuing to operate.
>
>         'strict' -- BUG_ON/or fault indicating the error.  This is the
>         most protective of the PMEM memory but may be undesirable in
>         some configurations.
>
> NOTE: The typedef of pks_fault_modes is required to allow
> param_check_pks_fault() to work automatically for us.  So the typedef
> checkpatch warning is ignored.

This doesn't parse for me, why is a typedef needed for a simple
toggle? Who is "us"?

>
> NOTE: There was some debate about if a 3rd mode called 'silent' should
> be available.  'silent' would be the same as 'relaxed' but not print any
> output.  While 'silent' is nice for admins to reduce console/log output
> it would result in less motivation to fix invalid access to the
> protected pmem pages.  Therefore, 'silent' is left out.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes for V8
>         Use pks_update_exception() instead of abandoning the pkey.
>         Split out pgmap_protection_flag_invalid() into a separate patch
>                 for clarity.
>         From Rick Edgecombe
>                 Fix sysfs_streq() checks
>         From Randy Dunlap
>                 Fix Documentation closing parans
>
> Changes for V7
>         Leverage Rick Edgecombe's fault callback infrastructure to relax invalid
>                 uses and prevent crashes
>         From Dan Williams
>                 Use sysfs_* calls for parameter
>                 Make pgmap_disable_protection inline
>                 Remove pfn from warn output
>         Remove silent parameter option
> ---
>  .../admin-guide/kernel-parameters.txt         | 14 ++++
>  arch/x86/mm/pkeys.c                           |  4 ++
>  include/linux/mm.h                            |  3 +
>  mm/memremap.c                                 | 67 +++++++++++++++++++
>  4 files changed, 88 insertions(+)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index f5a27f067db9..3e70a6194831 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4158,6 +4158,20 @@
>         pirq=           [SMP,APIC] Manual mp-table setup
>                         See Documentation/x86/i386/IO-APIC.rst.
>
> +       memremap.pks_fault_mode=        [X86] Control the behavior of page map
> +                       protection violations.  Violations may not be an actual
> +                       use of the memory but simply an attempt to map it in an
> +                       incompatible way.
> +                       (depends on CONFIG_DEVMAP_ACCESS_PROTECTION)
> +
> +                       Format: { relaxed | strict }
> +
> +                       relaxed - Print a warning, disable the protection and
> +                                 continue execution.
> +                       strict - Stop kernel execution via BUG_ON or fault
> +
> +                       default: relaxed
> +
>         plip=           [PPT,NET] Parallel port network link
>                         Format: { parport<nr> | timid | 0 }
>                         See also Documentation/admin-guide/parport.rst.
> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> index fa71037c1dd0..e864a9b7828a 100644
> --- a/arch/x86/mm/pkeys.c
> +++ b/arch/x86/mm/pkeys.c
> @@ -6,6 +6,7 @@
>  #include <linux/debugfs.h>             /* debugfs_create_u32()         */
>  #include <linux/mm_types.h>             /* mm_struct, vma, etc...       */
>  #include <linux/pkeys.h>                /* PKEY_*                       */
> +#include <linux/mm.h>                   /* fault callback               */
>  #include <uapi/asm-generic/mman-common.h>
>
>  #include <asm/cpufeature.h>             /* boot_cpu_has, ...            */
> @@ -243,6 +244,9 @@ static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = {
>  #ifdef CONFIG_PKS_TEST
>         [PKS_KEY_TEST]          = pks_test_fault_callback,
>  #endif
> +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> +       [PKS_KEY_PGMAP_PROTECTION]   = pgmap_pks_fault_callback,
> +#endif
>  };
>
>  static bool pks_call_fault_callback(struct pt_regs *regs, unsigned long address,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 60044de77c54..e900df563437 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1193,6 +1193,9 @@ static inline void pgmap_mk_noaccess(struct page *page)
>
>  bool pgmap_protection_available(void);
>
> +bool pgmap_pks_fault_callback(struct pt_regs *regs, unsigned long address,
> +                             bool write);
> +
>  #else
>
>  static inline void __pgmap_mk_readwrite(struct dev_pagemap *pgmap) { }
> diff --git a/mm/memremap.c b/mm/memremap.c
> index b75c4f778c59..783b1cd4bb42 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -96,6 +96,73 @@ static void devmap_protection_disable(void)
>         static_branch_dec(&dev_pgmap_protection_static_key);
>  }
>
> +/*
> + * Ignore the checkpatch warning because the typedef allows

Why document forever in perpetuity to ignore a checkpatch warning for
something that is no longer a patch once it is upstream?

> + * param_check_pks_fault_modes to automatically check the passed value.
> + */
> +typedef enum {
> +       PKS_MODE_STRICT  = 0,
> +       PKS_MODE_RELAXED = 1,
> +} pks_fault_modes;
> +
> +pks_fault_modes pks_fault_mode = PKS_MODE_RELAXED;
> +
> +static int param_set_pks_fault_mode(const char *val, const struct kernel_param *kp)
> +{
> +       int ret = -EINVAL;
> +
> +       if (sysfs_streq(val, "relaxed")) {
> +               pks_fault_mode = PKS_MODE_RELAXED;
> +               ret = 0;
> +       } else if (sysfs_streq(val, "strict")) {
> +               pks_fault_mode = PKS_MODE_STRICT;
> +               ret = 0;
> +       }
> +
> +       return ret;
> +}
> +
> +static int param_get_pks_fault_mode(char *buffer, const struct kernel_param *kp)
> +{
> +       int ret = 0;
> +
> +       switch (pks_fault_mode) {
> +       case PKS_MODE_STRICT:
> +               ret = sysfs_emit(buffer, "strict\n");
> +               break;
> +       case PKS_MODE_RELAXED:
> +               ret = sysfs_emit(buffer, "relaxed\n");
> +               break;
> +       default:
> +               ret = sysfs_emit(buffer, "<unknown>\n");
> +               break;
> +       }
> +
> +       return ret;
> +}
> +
> +static const struct kernel_param_ops param_ops_pks_fault_modes = {
> +       .set = param_set_pks_fault_mode,
> +       .get = param_get_pks_fault_mode,
> +};
> +
> +#define param_check_pks_fault_modes(name, p) \
> +       __param_check(name, p, pks_fault_modes)
> +module_param(pks_fault_mode, pks_fault_modes, 0644);

Is the complexity to change this at runtime necessary? It seems
sufficient to make this read-only via sysfs and only rely on command
line toggles to override the default policy.

> +
> +bool pgmap_pks_fault_callback(struct pt_regs *regs, unsigned long address,
> +                             bool write)
> +{
> +       /* In strict mode just let the fault handler oops */
> +       if (pks_fault_mode == PKS_MODE_STRICT)
> +               return false;
> +
> +       WARN_ONCE(1, "Page map protection being disabled");
> +       pks_update_exception(regs, PKS_KEY_PGMAP_PROTECTION, 0);
> +       return true;
> +}
> +EXPORT_SYMBOL_GPL(pgmap_pks_fault_callback);
> +
>  void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
>  {
>         if (!current->pgmap_prot_count++)
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS
  2022-01-28 23:51       ` Dave Hansen
@ 2022-02-04 19:08         ` Ira Weiny
  2022-02-09  5:34           ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Ira Weiny @ 2022-02-04 19:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Jan 28, 2022 at 03:51:56PM -0800, Dave Hansen wrote:
> On 1/28/22 15:10, Ira Weiny wrote:
> > This issue is that because PKS users are in kernel only and are not part of the
> > architecture specific code there needs to be 2 mechanisms within the Kconfig
> > structure.  One to communicate an architectures support PKS such that the user
> > who needs it can depend on that config as well as a second to allow that user
> > to communicate back to the architecture to enable PKS.
> 
> I *think* the point here is to ensure that PKS isn't compiled in unless
> it is supported *AND* needed.

Yes.

> You have to have architecture support
> (ARCH_HAS_SUPERVISOR_PKEYS) to permit features that depend on PKS to be
> enabled.  Then, once one ore more of *THOSE* is enabled,
> ARCH_ENABLE_SUPERVISOR_PKEYS comes into play and actually compiles the
> feature in.
> 
> In other words, there are two things that must happen before the code
> gets compiled in:
> 
> 1. Arch support
> 2. One or more features to use the arch support

Yes.  I really think we are both say the same thing with different words.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS
  2022-01-29  0:06   ` Dave Hansen
@ 2022-02-04 19:14     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-04 19:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Jan 28, 2022 at 04:06:29PM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> > @@ -1867,6 +1867,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
> >  	depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
> >  	select ARCH_USES_HIGH_VMA_FLAGS
> >  	select ARCH_HAS_PKEYS
> > +	select ARCH_HAS_SUPERVISOR_PKEYS
> 
> For now, this should be:
> 
> 	select ARCH_HAS_SUPERVISOR_PKEYS if CPU_SUP_INTEL
> 
> unless the AMD folks speak up and say otherwise. :)

Done for now.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 40/44] memremap_pages: Add pgmap_protection_flag_invalid()
  2022-01-27 17:55 ` [PATCH V8 40/44] memremap_pages: Add pgmap_protection_flag_invalid() ira.weiny
  2022-02-01  1:37   ` Edgecombe, Rick P
@ 2022-02-04 19:18   ` Dan Williams
  1 sibling, 0 replies; 145+ messages in thread
From: Dan Williams @ 2022-02-04 19:18 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
>
> From: Ira Weiny <ira.weiny@intel.com>
>
> Some systems may be using pmem in ways that are known to be incompatible

"systems"? You mean kernel code paths? Is it really plural "ways" and
not just the kmap() problem?

> with the PKS implementation.  One such example is the use of kmap() to
> create 'global' mappings.

What are the other examples? I.e. besides bugs, what are the
legitimate ways for the kernel to access memory that are now invalid
in the presence of kmap() access protections. They should all be
listed here. Like the CONFIG_64BIT direct page_address() use problem.
Are there others?

> Rather than only reporting the invalid access on fault, provide a call
> to flag those uses immediately.  This allows for a much better splat for
> debugging to occur.

It does? The faulting RIP will be in the splat, that's not good enough?

It just seems like a lost cause to try to get all the potential paths
that get access protection wrong to spend the time instrumenting this
self-incriminating call. Just let the relaxed mode WARNs "name and
shame" those code paths.

>
> This is also nice because even if no invalid access' actually occurs,
> the invalid mapping can be fixed with kmap_local_page() rather than
> having to look for a different solution.
>
> Define pgmap_protection_flag_invalid() and have it follow the policy set
> by pks_fault_mode.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes for V8
>         Split this from the fault mode patch
> ---
>  include/linux/mm.h | 23 +++++++++++++++++++++++
>  mm/memremap.c      |  9 +++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e900df563437..3c0aa686b5bd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,6 +1162,7 @@ static inline bool devmap_protected(struct page *page)
>         return false;
>  }
>
> +void __pgmap_protection_flag_invalid(struct dev_pagemap *pgmap);
>  void __pgmap_mk_readwrite(struct dev_pagemap *pgmap);
>  void __pgmap_mk_noaccess(struct dev_pagemap *pgmap);
>
> @@ -1178,6 +1179,27 @@ static inline bool pgmap_check_pgmap_prot(struct page *page)
>         return true;
>  }
>
> +/*
> + * pgmap_protection_flag_invalid - Check and flag an invalid use of a pgmap
> + *                                 protected page
> + *
> + * There are code paths which are known to not be compatible with pgmap
> + * protections.  pgmap_protection_flag_invalid() is provided as a 'relief
> + * valve' to be used in those functions which are known to be incompatible.
> + *
> + * Thus an invalid use case can be flaged with more precise data rather than
> + * just flagging a fault.  Like the fault handler code this abandons the use of
> + * the PKS key and optionally allows the calling code path to continue based on
> + * the configuration of the memremap.pks_fault_mode command line
> + * (and/or sysfs) option.
> + */
> +static inline void pgmap_protection_flag_invalid(struct page *page)
> +{
> +       if (!pgmap_check_pgmap_prot(page))
> +               return;
> +       __pgmap_protection_flag_invalid(page->pgmap);
> +}
> +
>  static inline void pgmap_mk_readwrite(struct page *page)
>  {
>         if (!pgmap_check_pgmap_prot(page))
> @@ -1200,6 +1222,7 @@ bool pgmap_pks_fault_callback(struct pt_regs *regs, unsigned long address,
>
>  static inline void __pgmap_mk_readwrite(struct dev_pagemap *pgmap) { }
>  static inline void __pgmap_mk_noaccess(struct dev_pagemap *pgmap) { }
> +static inline void pgmap_protection_flag_invalid(struct page *page) { }
>  static inline void pgmap_mk_readwrite(struct page *page) { }
>  static inline void pgmap_mk_noaccess(struct page *page) { }
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 783b1cd4bb42..fd4b9b83b770 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -150,6 +150,15 @@ static const struct kernel_param_ops param_ops_pks_fault_modes = {
>         __param_check(name, p, pks_fault_modes)
>  module_param(pks_fault_mode, pks_fault_modes, 0644);
>
> +void __pgmap_protection_flag_invalid(struct dev_pagemap *pgmap)
> +{
> +       if (pks_fault_mode == PKS_MODE_STRICT)
> +               return;
> +
> +       WARN_ONCE(1, "Invalid page map use");
> +}
> +EXPORT_SYMBOL_GPL(__pgmap_protection_flag_invalid);
> +
>  bool pgmap_pks_fault_callback(struct pt_regs *regs, unsigned long address,
>                               bool write)
>  {
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 07/44] x86/pkeys: Add PKS CPU feature bit
  2022-01-28 23:05   ` Dave Hansen
@ 2022-02-04 19:21     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-04 19:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Jan 28, 2022 at 03:05:36PM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
> > specific, manipulation of permission restrictions on supervisor page
> 
> Nit: should be "hardware-thread-specific".
> 
> > mappings.  It uses the same mechanism of Protection Keys as those on
> > User mappings but applies that mechanism to supervisor mappings using a
> > supervisor specific MSR.
> 
> "supervisor-specific"
> 
> 	Memory Protection Keys (pkeys) provides a mechanism for
> 	enforcing page-based protections, but without requiring
> 	modification of the page tables when an application changes
> 	protection domains.
> 
> 	The kernel currently supports the pkeys for userspace (PKU)
> 	architecture.  That architecture has been extended to
> 	additionally support supervisor mappings.  The supervisor
> 	support is referred to as PKS.
> 
> I probably wouldn't mention the MSR unless you want to say:
> 
> 	The main difference between PKU and PKS is that PKS does not
> 	introduce any new instructions to write to its register.  The
> 	register is exposed as a normal MSR and is accessed with the
> 	normal MSR instructions.
> 
> 
> > The CPU indicates support for PKS in bit 31 of the ECX register after a
> > cpuid instruction.
> 
> I'd just remove this sentence.  We don't need to rehash each tiny morsel
> of the architecture in a commit message.

All done.  Thanks for the verbiage.
Ira


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 08/44] x86/fault: Adjust WARN_ON for PKey fault
  2022-01-28 23:10   ` Dave Hansen
@ 2022-02-04 20:06     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-04 20:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Jan 28, 2022 at 03:10:24PM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Previously if a Protection key fault occurred it indicated something
> > very wrong because user page mappings are not supposed to be in the
> > kernel address space.
> 
> This is missing a key point.  The problem is PK faults on "*kernel*
> addresses.

Ok, I'll try and clarify.

> 
> > Now PKey faults may happen on kernel mappings if the feature is enabled.
> 
> One nit: I've been using "pkeys" and "pkey" as the terms.  I usually
> don't capitalize them except at the beginning of a sentence.

I'll audit the series to use lower case for consistency.

> 
> > If PKS is enabled, avoid the warning in the fault path.
> > 
> > Cc: Sean Christopherson <seanjc@google.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > ---
> >  arch/x86/mm/fault.c | 12 ++++++++----
> >  1 file changed, 8 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index d0074c6ed31a..6ed91b632eac 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1148,11 +1148,15 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> >  		   unsigned long address)
> >  {
> >  	/*
> > -	 * Protection keys exceptions only happen on user pages.  We
> > -	 * have no user pages in the kernel portion of the address
> > -	 * space, so do not expect them here.
> > +	 * X86_PF_PK (Protection key exceptions) may occur on kernel addresses
> > +	 * when PKS (PKeys Supervisor) is enabled.
> > +	 *
> > +	 * However, if PKS is not enabled WARN if this exception is seen
> > +	 * because there are no user pages in the kernel portion of the address
> > +	 * space.
> >  	 */
> > -	WARN_ON_ONCE(hw_error_code & X86_PF_PK);
> > +	WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS) &&
> > +		     (hw_error_code & X86_PF_PK));
> >  
> >  #ifdef CONFIG_X86_32
> >  	/*
> 
> I'm wondering if this warning is even doing us any good.  I'm pretty
> sure it's never triggered on me at least.  Either way, let's not get too
> carried away with the comment.  I think this should do:
> 
> 	/*
> 	 * PF_PF faults should only occur on kernel
> 	 * addresses when supervisor pkeys are enabled.
> 	 */

Sounds better,
Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 41/44] kmap: Ensure kmap works for devmap pages
  2022-01-27 17:55 ` [PATCH V8 41/44] kmap: Ensure kmap works for devmap pages ira.weiny
@ 2022-02-04 21:07   ` Dan Williams
  2022-03-01 19:45     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dan Williams @ 2022-02-04 21:07 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
>
> From: Ira Weiny <ira.weiny@intel.com>
>
> Users of devmap pages should not have to know that the pages they are
> operating on are special.

How about get straight to the point without any ambiguous references:

Today, kmap_{local_page,atomic} handles granting access to HIGHMEM
pages without the caller needing to know if the page is HIGHMEM, or
not. Use that existing infrastructure to grant access to PKS/PGMAP
access protected pages.

> Co-opt the kmap_{local_page,atomic}() to mediate access to PKS protected
> pages via the devmap facility.  kmap_{local_page,atomic}() are both
> thread local mappings so they work well with the thread specific
> protections available.
>
> kmap(), on the other hand, allows for global mappings to be established,
> Which is incompatible with the underlying PKS facility.

Why is kmap incompatible with PKS? I know why, but this is a claim
without evidence. If you documented that in a previous patch, there's
no harm and copying and pasting into this one. A future git log user
will thank you for not making them go to lore to try to find the one
patch with the  details.  Extra credit for creating a PKS theory of
operation document with this detail, unless I missed that?

> For this reason
> kmap() is not supported.  Rather than leave the kmap mappings to fault
> at random times when users may access them,

Is that a problem? This instrumentation is also insufficient for
legitimate usages of page_address(). Might as well rely on the kernel
developer community being able to debug PKS WARN() splats back to the
source because that will need to be done regardless, given kmap() is
not the only source of false positive access violations.

> call
> pgmap_protection_flag_invalid() to show kmap() users the call stack of
> where mapping was created.  This allows better debugging.
>
> This behavior is safe because neither of the 2 current DAX-capable
> filesystems (ext4 and xfs) perform such global mappings.  And known
> device drivers that would handle devmap pages are not using kmap().  Any
> future filesystems that gain DAX support, or device drivers wanting to
> support devmap protected pages will need to use kmap_local_page().
>
> Direct-map exposure is already mitigated by default on HIGHMEM systems
> because by definition HIGHMEM systems do not have large capacities of
> memory in the direct map.  And using kmap in those systems actually
> creates a separate mapping.  Therefore, to reduce complexity HIGHMEM
> systems are not supported.

It was only at the end of this paragraph did I understand why I was
reading this paragraph. The change in topic was buried. I.e.

---

Note: HIGHMEM support is mutually exclusive with PGMAP protection. The
rationale is mainly to reduce complexity, but also because direct-map
exposure is already mitigated by default on HIGHMEM systems  because
by definition HIGHMEM systems do not have large capacities of memory
in the direct map...

---

That note and related change should probably go in the same patch that
introduces CONFIG_DEVMAP_ACCESS_PROTECTION in the first place. It's an
unrelated change to instrumenting kmap() to fail early, which again I
don't think is strictly necessary.

>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes for V8
>         Reword commit message
> ---
>  include/linux/highmem-internal.h | 5 +++++
>  mm/Kconfig                       | 1 +
>  2 files changed, 6 insertions(+)
>
> diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
> index 0a0b2b09b1b8..1a006558734c 100644
> --- a/include/linux/highmem-internal.h
> +++ b/include/linux/highmem-internal.h
> @@ -159,6 +159,7 @@ static inline struct page *kmap_to_page(void *addr)
>  static inline void *kmap(struct page *page)
>  {
>         might_sleep();
> +       pgmap_protection_flag_invalid(page);
>         return page_address(page);
>  }
>
> @@ -174,6 +175,7 @@ static inline void kunmap(struct page *page)
>
>  static inline void *kmap_local_page(struct page *page)
>  {
> +       pgmap_mk_readwrite(page);
>         return page_address(page);
>  }
>
> @@ -197,6 +199,7 @@ static inline void __kunmap_local(void *addr)
>  #ifdef ARCH_HAS_FLUSH_ON_KUNMAP
>         kunmap_flush_on_unmap(addr);
>  #endif
> +       pgmap_mk_noaccess(kmap_to_page(addr));
>  }
>
>  static inline void *kmap_atomic(struct page *page)
> @@ -206,6 +209,7 @@ static inline void *kmap_atomic(struct page *page)
>         else
>                 preempt_disable();
>         pagefault_disable();
> +       pgmap_mk_readwrite(page);
>         return page_address(page);
>  }
>
> @@ -224,6 +228,7 @@ static inline void __kunmap_atomic(void *addr)
>  #ifdef ARCH_HAS_FLUSH_ON_KUNMAP
>         kunmap_flush_on_unmap(addr);
>  #endif
> +       pgmap_mk_noaccess(kmap_to_page(addr));
>         pagefault_enable();
>         if (IS_ENABLED(CONFIG_PREEMPT_RT))
>                 migrate_enable();
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 67e0264acf7d..d537679448ae 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -779,6 +779,7 @@ config ZONE_DEVICE
>  config DEVMAP_ACCESS_PROTECTION
>         bool "Access protection for memremap_pages()"
>         depends on NVDIMM_PFN
> +       depends on !HIGHMEM
>         depends on ARCH_HAS_SUPERVISOR_PKEYS
>         select ARCH_ENABLE_SUPERVISOR_PKEYS
>         default y
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 43/44] nvdimm/pmem: Enable stray access protection
  2022-01-27 17:55 ` [PATCH V8 43/44] nvdimm/pmem: Enable stray access protection ira.weiny
@ 2022-02-04 21:10   ` Dan Williams
  2022-03-01 18:18     ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dan Williams @ 2022-02-04 21:10 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
>
> From: Ira Weiny <ira.weiny@intel.com>
>
> Now that all valid kernel access' to PMEM have been annotated with
> {__}pgmap_mk_{readwrite,noaccess}() PGMAP_PROTECTION is safe to enable
> in the pmem layer.
>
> Implement the pmem_map_protected() and pmem_mk_{readwrite,noaccess}() to
> communicate this memory has extra protection to the upper layers if
> PGMAP_PROTECTION is specified.
>
> Internally, the pmem driver uses a cached virtual address,
> pmem->virt_addr (pmem_addr).  Use __pgmap_mk_{readwrite,noaccess}()
> directly when PGMAP_PROTECTION is active on the device.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes for V8
>         Rebase to 5.17-rc1
>         Remove global param
>         Add internal structure which uses the pmem device and pgmap
>                 device directly in the *_mk_*() calls.
>         Add pmem dax ops callbacks
>         Use pgmap_protection_available()
>         s/PGMAP_PKEY_PROTECT/PGMAP_PROTECTION
> ---
>  drivers/nvdimm/pmem.c | 52 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 51 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 58d95242a836..2afff8157233 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -138,6 +138,18 @@ static blk_status_t read_pmem(struct page *page, unsigned int off,
>         return BLK_STS_OK;
>  }
>
> +static void __pmem_mk_readwrite(struct pmem_device *pmem)
> +{
> +       if (pmem->pgmap.flags & PGMAP_PROTECTION)
> +               __pgmap_mk_readwrite(&pmem->pgmap);
> +}
> +
> +static void __pmem_mk_noaccess(struct pmem_device *pmem)
> +{
> +       if (pmem->pgmap.flags & PGMAP_PROTECTION)
> +               __pgmap_mk_noaccess(&pmem->pgmap);
> +}
> +

Per previous feedback let's find a way for the pmem driver to stay out
of the loop, and just let these toggles by pgmap generic operations.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 44/44] devdax: Enable stray access protection
  2022-01-27 17:55 ` [PATCH V8 44/44] devdax: " ira.weiny
@ 2022-02-04 21:12   ` Dan Williams
  0 siblings, 0 replies; 145+ messages in thread
From: Dan Williams @ 2022-02-04 21:12 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
>
> From: Ira Weiny <ira.weiny@intel.com>
>
> Device dax is primarily accessed through user space and kernel access is
> controlled through the kmap interfaces.
>
> Now that all valid kernel initiated access to dax devices have been
> accounted for, turn on PGMAP_PKEYS_PROTECT for device dax.
>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
>
> ---
> Changes for V8
>         Rebase to 5.17-rc1
>         Use pgmap_protection_available()
>         s/PGMAP_PKEYS_PROTECT/PGMAP_PROTECTION/
> ---
>  drivers/dax/device.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index d33a0613ed0c..cee375ef2cac 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -452,6 +452,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>         if (dev_dax->align > PAGE_SIZE)
>                 pgmap->vmemmap_shift =
>                         order_base_2(dev_dax->align >> PAGE_SHIFT);
> +       if (pgmap_protection_available())
> +               pgmap->flags |= PGMAP_PROTECTION;

Looks good.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 11/44] mm/pkeys: Define static PKS key array and default values
  2022-01-29  0:02   ` Dave Hansen
@ 2022-02-04 23:54     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-04 23:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Jan 28, 2022 at 04:02:05PM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> > +#define PKS_INIT_VALUE (PKR_RW_KEY(PKS_KEY_DEFAULT)		| \
> > +			PKR_AD_KEY(1)	| \
> > +			PKR_AD_KEY(2)	| PKR_AD_KEY(3)		| \
> > +			PKR_AD_KEY(4)	| PKR_AD_KEY(5)		| \
> > +			PKR_AD_KEY(6)	| PKR_AD_KEY(7)		| \
> > +			PKR_AD_KEY(8)	| PKR_AD_KEY(9)		| \
> > +			PKR_AD_KEY(10)	| PKR_AD_KEY(11)	| \
> > +			PKR_AD_KEY(12)	| PKR_AD_KEY(13)	| \
> > +			PKR_AD_KEY(14)	| PKR_AD_KEY(15))
> 
> Considering how this is going to get used, let's just make this
> one-key-per-line:
> 
> #define PKS_INIT_VALUE (PKR_RW_KEY(PKS_KEY_DEFAULT)		| \
> 			PKR_AD_KEY(1)	| \
> 			PKR_AD_KEY(2)	| \
> 			PKR_AD_KEY(3)	| \
> 			...

Good idea, done.
Ira


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 38/44] memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls
  2022-02-04 18:35   ` Dan Williams
@ 2022-02-05  0:09     ` Ira Weiny
  2022-02-05  0:19       ` Dan Williams
  2022-02-22 22:05     ` Ira Weiny
  1 sibling, 1 reply; 145+ messages in thread
From: Ira Weiny @ 2022-02-05  0:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Fri, Feb 04, 2022 at 10:35:59AM -0800, Dan Williams wrote:
> On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> >

[snip]

I'll address the other comments later but wanted to address the idea below.

> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index f5b2be39a78c..5020ed7e67b7 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1492,6 +1492,13 @@ struct task_struct {
> >         struct callback_head            l1d_flush_kill;
> >  #endif
> >
> > +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> > +       /*
> > +        * NOTE: pgmap_prot_count is modified within a single thread of
> > +        * execution.  So it does not need to be atomic_t.
> > +        */
> > +       u32                             pgmap_prot_count;
> > +#endif
> 
> It's not at all clear why the task struct needs to be burdened with
> this accounting. Given that a devmap instance is needed to manage page
> protections, why not move the nested protection tracking to a percpu
> variable relative to an @pgmap arg? Something like:
> 
> void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
> {
>        migrate_disable();
>        preempt_disable();

Why burden threads like this?  kmap_local_page() is perfectly able to migrate
or be preempted?

I think this is way to restrictive.

>        if (this_cpu_add_return(pgmap->pgmap_prot_count, 1) == 1)
>                pks_mk_readwrite(PKS_KEY_PGMAP_PROTECTION);
> }
> EXPORT_SYMBOL_GPL(__pgmap_mk_readwrite);
> 
> void __pgmap_mk_noaccess(struct dev_pagemap *pgmap)
> {
>        if (!this_cpu_sub_return(pgmap->pgmap_prot_count, 1))
>                pks_mk_noaccess(PKS_KEY_PGMAP_PROTECTION);
>        preempt_enable();
>        migrate_enable();
> }
> EXPORT_SYMBOL_GPL(__pgmap_mk_noaccess);
> 
> The naming, which I had a hand in, is not aging well. When I see "mk"
> I expect it to be building some value like a page table entry that
> will be installed later. These helpers are directly enabling and
> disabling access and are meant to be called symmetrically. So I would
> expect symmetric names like:
> 
> pgmap_enable_access()
> pgmap_disable_access()

Names are easily changed.  I'll look at changing the names.

Ira

> 
> 
> >         /*
> >          * New fields for task_struct should be added above here, so that
> >          * they are included in the randomized portion of task_struct.
> > diff --git a/init/init_task.c b/init/init_task.c
> > index 73cc8f03511a..948b32cf8139 100644
> > --- a/init/init_task.c
> > +++ b/init/init_task.c
> > @@ -209,6 +209,9 @@ struct task_struct init_task
> >  #ifdef CONFIG_SECCOMP_FILTER
> >         .seccomp        = { .filter_count = ATOMIC_INIT(0) },
> >  #endif
> > +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> > +       .pgmap_prot_count = 0,
> > +#endif
> >  };
> >  EXPORT_SYMBOL(init_task);
> >
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index d3e6f328a711..b75c4f778c59 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -96,6 +96,20 @@ static void devmap_protection_disable(void)
> >         static_branch_dec(&dev_pgmap_protection_static_key);
> >  }
> >
> > +void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
> > +{
> > +       if (!current->pgmap_prot_count++)
> > +               pks_mk_readwrite(PKS_KEY_PGMAP_PROTECTION);
> > +}
> > +EXPORT_SYMBOL_GPL(__pgmap_mk_readwrite);
> > +
> > +void __pgmap_mk_noaccess(struct dev_pagemap *pgmap)
> > +{
> > +       if (!--current->pgmap_prot_count)
> > +               pks_mk_noaccess(PKS_KEY_PGMAP_PROTECTION);
> > +}
> > +EXPORT_SYMBOL_GPL(__pgmap_mk_noaccess);
> > +
> >  bool pgmap_protection_available(void)
> >  {
> >         return pks_available();
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 38/44] memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls
  2022-02-05  0:09     ` Ira Weiny
@ 2022-02-05  0:19       ` Dan Williams
  2022-02-05  0:25         ` Dan Williams
  0 siblings, 1 reply; 145+ messages in thread
From: Dan Williams @ 2022-02-05  0:19 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Fri, Feb 4, 2022 at 4:10 PM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Fri, Feb 04, 2022 at 10:35:59AM -0800, Dan Williams wrote:
> > On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> > >
>
> [snip]
>
> I'll address the other comments later but wanted to address the idea below.
>
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index f5b2be39a78c..5020ed7e67b7 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1492,6 +1492,13 @@ struct task_struct {
> > >         struct callback_head            l1d_flush_kill;
> > >  #endif
> > >
> > > +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> > > +       /*
> > > +        * NOTE: pgmap_prot_count is modified within a single thread of
> > > +        * execution.  So it does not need to be atomic_t.
> > > +        */
> > > +       u32                             pgmap_prot_count;
> > > +#endif
> >
> > It's not at all clear why the task struct needs to be burdened with
> > this accounting. Given that a devmap instance is needed to manage page
> > protections, why not move the nested protection tracking to a percpu
> > variable relative to an @pgmap arg? Something like:
> >
> > void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
> > {
> >        migrate_disable();
> >        preempt_disable();
>
> Why burden threads like this?  kmap_local_page() is perfectly able to migrate
> or be preempted?
>
> I think this is way to restrictive.

kmap_local_page() holds migrate_disable() over the entire mapping, so
we're only talking about preempt_disable(). I tend to think that
bloating task_struct for something that is rarely used "kmap on dax
pmem pages" is not the right tradeoff.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 38/44] memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls
  2022-02-05  0:19       ` Dan Williams
@ 2022-02-05  0:25         ` Dan Williams
  2022-02-05  0:27           ` Dan Williams
  0 siblings, 1 reply; 145+ messages in thread
From: Dan Williams @ 2022-02-05  0:25 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Fri, Feb 4, 2022 at 4:19 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Fri, Feb 4, 2022 at 4:10 PM Ira Weiny <ira.weiny@intel.com> wrote:
> >
> > On Fri, Feb 04, 2022 at 10:35:59AM -0800, Dan Williams wrote:
> > > On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> > > >
> >
> > [snip]
> >
> > I'll address the other comments later but wanted to address the idea below.
> >
> > > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > > index f5b2be39a78c..5020ed7e67b7 100644
> > > > --- a/include/linux/sched.h
> > > > +++ b/include/linux/sched.h
> > > > @@ -1492,6 +1492,13 @@ struct task_struct {
> > > >         struct callback_head            l1d_flush_kill;
> > > >  #endif
> > > >
> > > > +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> > > > +       /*
> > > > +        * NOTE: pgmap_prot_count is modified within a single thread of
> > > > +        * execution.  So it does not need to be atomic_t.
> > > > +        */
> > > > +       u32                             pgmap_prot_count;
> > > > +#endif
> > >
> > > It's not at all clear why the task struct needs to be burdened with
> > > this accounting. Given that a devmap instance is needed to manage page
> > > protections, why not move the nested protection tracking to a percpu
> > > variable relative to an @pgmap arg? Something like:
> > >
> > > void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
> > > {
> > >        migrate_disable();
> > >        preempt_disable();
> >
> > Why burden threads like this?  kmap_local_page() is perfectly able to migrate
> > or be preempted?
> >
> > I think this is way to restrictive.
>
> kmap_local_page() holds migrate_disable() over the entire mapping, so
> we're only talking about preempt_disable(). I tend to think that
> bloating task_struct for something that is rarely used "kmap on dax
> pmem pages" is not the right tradeoff.

Now, I can see an argument that promoting kmap_local_page() to
preempt_disable() could cause problems, but I'd like help confirming
that before committing to extending task_struct.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 38/44] memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls
  2022-02-05  0:25         ` Dan Williams
@ 2022-02-05  0:27           ` Dan Williams
  2022-02-05  5:55             ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dan Williams @ 2022-02-05  0:27 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Fri, Feb 4, 2022 at 4:25 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Fri, Feb 4, 2022 at 4:19 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Fri, Feb 4, 2022 at 4:10 PM Ira Weiny <ira.weiny@intel.com> wrote:
> > >
> > > On Fri, Feb 04, 2022 at 10:35:59AM -0800, Dan Williams wrote:
> > > > On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> > > > >
> > >
> > > [snip]
> > >
> > > I'll address the other comments later but wanted to address the idea below.
> > >
> > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > > > index f5b2be39a78c..5020ed7e67b7 100644
> > > > > --- a/include/linux/sched.h
> > > > > +++ b/include/linux/sched.h
> > > > > @@ -1492,6 +1492,13 @@ struct task_struct {
> > > > >         struct callback_head            l1d_flush_kill;
> > > > >  #endif
> > > > >
> > > > > +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> > > > > +       /*
> > > > > +        * NOTE: pgmap_prot_count is modified within a single thread of
> > > > > +        * execution.  So it does not need to be atomic_t.
> > > > > +        */
> > > > > +       u32                             pgmap_prot_count;
> > > > > +#endif
> > > >
> > > > It's not at all clear why the task struct needs to be burdened with
> > > > this accounting. Given that a devmap instance is needed to manage page
> > > > protections, why not move the nested protection tracking to a percpu
> > > > variable relative to an @pgmap arg? Something like:
> > > >
> > > > void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
> > > > {
> > > >        migrate_disable();
> > > >        preempt_disable();
> > >
> > > Why burden threads like this?  kmap_local_page() is perfectly able to migrate
> > > or be preempted?
> > >
> > > I think this is way to restrictive.
> >
> > kmap_local_page() holds migrate_disable() over the entire mapping, so
> > we're only talking about preempt_disable(). I tend to think that
> > bloating task_struct for something that is rarely used "kmap on dax
> > pmem pages" is not the right tradeoff.
>
> Now, I can see an argument that promoting kmap_local_page() to
> preempt_disable() could cause problems, but I'd like help confirming
> that before committing to extending task_struct.

...as I say that it occurs to me that the whole point of
kmap_local_page() is to be better than kmap_atomic() and this undoes
that. I'd at least like that documented as the reason that task_struct
needs to carry a new field.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 36/44] memremap_pages: Reserve a PKS PKey for eventual use by PMEM
  2022-02-04 17:12     ` Dan Williams
@ 2022-02-05  5:40       ` Ira Weiny
  2022-02-05  8:19         ` Dan Williams
  0 siblings, 1 reply; 145+ messages in thread
From: Ira Weiny @ 2022-02-05  5:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Edgecombe, Rick P, hpa, dave.hansen, Yu, Fenghua, linux-kernel

On Fri, Feb 04, 2022 at 09:12:11AM -0800, Dan Williams wrote:
> On Tue, Feb 1, 2022 at 10:35 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> >
> > On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> > >  enum pks_pkey_consumers {
> > > -       PKS_KEY_DEFAULT         = 0, /* Must be 0 for default PTE
> > > values */
> > > -       PKS_KEY_TEST            = 1,
> > > -       PKS_KEY_NR_CONSUMERS    = 2,
> > > +       PKS_KEY_DEFAULT                 = 0, /* Must be 0 for default
> > > PTE values */
> > > +       PKS_KEY_TEST                    = 1,
> > > +       PKS_KEY_PGMAP_PROTECTION        = 2,
> > > +       PKS_KEY_NR_CONSUMERS            = 3,
> > >  };
> >
> > The c spec says that any enum member that doesn't have an "=" will be
> > one more than the previous member. As a consequence you can leave the
> > "=" off PKS_KEY_NR_CONSUMERS and it will get auto adjusted when you add
> > more like this.
> >
> > I know we've gone around and around on this, but why also specify the
> > value for each key? They should auto increment and the first one is
> > guaranteed to be zero.

Because it was easier to ensure that the init value had all the defaults
covered.

> >
> > Otherwise this doesn't use any of the features of "enum", it's just a
> > verbose series of const int's.

True but does this really matter?

> 
> Going further, this can also build in support for dynamically (at
> build time) freeing keys based on config, something like:
> 
> enum {
> #if IS_ENABLED(CONFIG_PKS_TEST)
> PKS_KEY_TEST,
> #endif
> #if IS_ENABLED(CONFIG_DEVMAP_PROTECTION)
> PKS_KEY_PGMAP_PROTECTION,
> #endif
> PKS_KEY_NR_CONSUMERS,
> }

This is all well and good until you get to the point of trying to define the
initial MSR value.

What Rick proposes without the Kconfig check is easier than this.  But to do
what both you and Rick suggest this is the best crap I've been able to come up
with that actually works...


/* pkeys_common.h */
#define PKR_AD_BIT 0x1u
#define PKR_WD_BIT 0x2u
#define PKR_BITS_PER_PKEY 2

#define PKR_PKEY_SHIFT(pkey)    (pkey * PKR_BITS_PER_PKEY)

#define PKR_KEY_INIT_RW(pkey)   (0          << PKR_PKEY_SHIFT(pkey))
#define PKR_KEY_INIT_AD(pkey)   (PKR_AD_BIT << PKR_PKEY_SHIFT(pkey))
#define PKR_KEY_INIT_WD(pkey)   (PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))


/* pks-keys.h */
#define PKR_KEY_MASK(pkey)   (0xffffffff & ~((PKR_WD_BIT|PKR_AD_BIT) << PKR_PKEY_SHIFT(pkey)))

enum pks_pkey_consumers {
        PKS_KEY_DEFAULT                 = 0, /* Must be 0 for default PTE values */
#if IS_ENABLED(CONFIG_PKS_TEST)
        PKS_KEY_TEST,
#endif
#if IS_ENABLED(CONFIG_DEVMAP_ACCESS_PROTECTION)
        PKS_KEY_PGMAP_PROTECTION,
#endif
        PKS_KEY_NR_CONSUMERS
};

#define PKS_DEFAULT_VALUE PKR_KEY_INIT_RW(PKS_KEY_DEFAULT)
#define PKS_DEFAULT_MASK  PKR_KEY_MASK(PKS_KEY_DEFAULT)

#if IS_ENABLED(CONFIG_PKS_TEST)
#define PKS_TEST_VALUE PKR_KEY_INIT_AD(PKS_KEY_TEST)
#define PKS_TEST_MASK  PKR_KEY_MASK(PKS_KEY_TEST)
#else
/* Just define another default value to fool the CPP */
#define PKS_TEST_VALUE PKR_KEY_INIT_RW(0)
#define PKS_TEST_MASK  PKR_KEY_MASK(0)
#endif

#if IS_ENABLED(CONFIG_DEVMAP_ACCESS_PROTECTION)
#define PKS_PGMAP_VALUE PKR_KEY_INIT_AD(PKS_KEY_PGMAP_PROTECTION)
#define PKS_PGMAP_MASK  PKR_KEY_MASK(PKS_KEY_PGMAP_PROTECTION)
#else
/* Just define another default value to fool the CPP */
#define PKS_PGMAP_VALUE PKR_KEY_INIT_RW(0)
#define PKS_PGMAP_MASK  PKR_KEY_MASK(0)
#endif

#define PKS_INIT_VALUE ((0xFFFFFFFF & \
                        (PKS_DEFAULT_MASK & \
                                PKS_TEST_MASK & \
                                PKS_PGMAP_MASK \
                        )) | \
                        (PKS_DEFAULT_VALUE | \
                        PKS_TEST_VALUE | \
                        PKS_PGMAP_VALUE \
                        ) \
                        )


I find the above much harder to parse and of little value.  I'm pretty sure
that someone adding a key is much more likely to get the macro maze wrong.
Reviewing a patch to add a key would be much more difficult as well, IMHO.

I'm a bit tired of this back and forth trying to implement this for features
which may never exist.  In all my discussions I don't think we have reached
more than 10 use cases in our wildest dreams and only 4 have even been
attempted with real code including the PKS test.

So I'm just a bit frustrated with the effort we have put in to try and do
dynamic or even compile time dynamic keys.

Anyway I'll think on it more.  But I'm inclined to leave it alone right now.
What I have is easy to review for correctness and only takes a bit of effort to
actually use.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 38/44] memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls
  2022-02-05  0:27           ` Dan Williams
@ 2022-02-05  5:55             ` Ira Weiny
  2022-02-05  6:28               ` Dan Williams
  0 siblings, 1 reply; 145+ messages in thread
From: Ira Weiny @ 2022-02-05  5:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Fri, Feb 04, 2022 at 04:27:38PM -0800, Dan Williams wrote:
> On Fri, Feb 4, 2022 at 4:25 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Fri, Feb 4, 2022 at 4:19 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Fri, Feb 4, 2022 at 4:10 PM Ira Weiny <ira.weiny@intel.com> wrote:
> > > >
> > > > On Fri, Feb 04, 2022 at 10:35:59AM -0800, Dan Williams wrote:
> > > > > On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> > > > > >
> > > >
> > > > [snip]
> > > >
> > > > I'll address the other comments later but wanted to address the idea below.
> > > >
> > > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > > > > index f5b2be39a78c..5020ed7e67b7 100644
> > > > > > --- a/include/linux/sched.h
> > > > > > +++ b/include/linux/sched.h
> > > > > > @@ -1492,6 +1492,13 @@ struct task_struct {
> > > > > >         struct callback_head            l1d_flush_kill;
> > > > > >  #endif
> > > > > >
> > > > > > +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> > > > > > +       /*
> > > > > > +        * NOTE: pgmap_prot_count is modified within a single thread of
> > > > > > +        * execution.  So it does not need to be atomic_t.
> > > > > > +        */
> > > > > > +       u32                             pgmap_prot_count;
> > > > > > +#endif
> > > > >
> > > > > It's not at all clear why the task struct needs to be burdened with
> > > > > this accounting. Given that a devmap instance is needed to manage page
> > > > > protections, why not move the nested protection tracking to a percpu
> > > > > variable relative to an @pgmap arg? Something like:
> > > > >
> > > > > void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
> > > > > {
> > > > >        migrate_disable();
> > > > >        preempt_disable();
> > > >
> > > > Why burden threads like this?  kmap_local_page() is perfectly able to migrate
> > > > or be preempted?
> > > >
> > > > I think this is way to restrictive.
> > >
> > > kmap_local_page() holds migrate_disable() over the entire mapping, so
> > > we're only talking about preempt_disable(). I tend to think that
> > > bloating task_struct for something that is rarely used "kmap on dax
> > > pmem pages" is not the right tradeoff.
> >
> > Now, I can see an argument that promoting kmap_local_page() to
> > preempt_disable() could cause problems, but I'd like help confirming
> > that before committing to extending task_struct.
> 
> ...as I say that it occurs to me that the whole point of
> kmap_local_page() is to be better than kmap_atomic() and this undoes
> that. I'd at least like that documented as the reason that task_struct
> needs to carry a new field.

I'll try and update the commit message but kmap_local_page() only disables
migrate on a highmem system.  The devmap/PKS use case is specifically not
supported on highmem systems.  Mainly because on highmem systems
kmap_local_page() actually creates a new mapping which is not covered by PKS
anyway.

So for the devmap/PKS use case kmap_local_page() is defined as:

 static inline void *kmap_local_page(struct page *page)
 {
+       pgmap_mk_readwrite(page);
        return page_address(page);
 }

...for the linear mapping.  I'll try and update the commit message with this
detail.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 38/44] memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls
  2022-02-05  5:55             ` Ira Weiny
@ 2022-02-05  6:28               ` Dan Williams
  0 siblings, 0 replies; 145+ messages in thread
From: Dan Williams @ 2022-02-05  6:28 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Fri, Feb 4, 2022 at 9:56 PM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Fri, Feb 04, 2022 at 04:27:38PM -0800, Dan Williams wrote:
> > On Fri, Feb 4, 2022 at 4:25 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Fri, Feb 4, 2022 at 4:19 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > > >
> > > > On Fri, Feb 4, 2022 at 4:10 PM Ira Weiny <ira.weiny@intel.com> wrote:
> > > > >
> > > > > On Fri, Feb 04, 2022 at 10:35:59AM -0800, Dan Williams wrote:
> > > > > > On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> > > > > > >
> > > > >
> > > > > [snip]
> > > > >
> > > > > I'll address the other comments later but wanted to address the idea below.
> > > > >
> > > > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > > > > > index f5b2be39a78c..5020ed7e67b7 100644
> > > > > > > --- a/include/linux/sched.h
> > > > > > > +++ b/include/linux/sched.h
> > > > > > > @@ -1492,6 +1492,13 @@ struct task_struct {
> > > > > > >         struct callback_head            l1d_flush_kill;
> > > > > > >  #endif
> > > > > > >
> > > > > > > +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> > > > > > > +       /*
> > > > > > > +        * NOTE: pgmap_prot_count is modified within a single thread of
> > > > > > > +        * execution.  So it does not need to be atomic_t.
> > > > > > > +        */
> > > > > > > +       u32                             pgmap_prot_count;
> > > > > > > +#endif
> > > > > >
> > > > > > It's not at all clear why the task struct needs to be burdened with
> > > > > > this accounting. Given that a devmap instance is needed to manage page
> > > > > > protections, why not move the nested protection tracking to a percpu
> > > > > > variable relative to an @pgmap arg? Something like:
> > > > > >
> > > > > > void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
> > > > > > {
> > > > > >        migrate_disable();
> > > > > >        preempt_disable();
> > > > >
> > > > > Why burden threads like this?  kmap_local_page() is perfectly able to migrate
> > > > > or be preempted?
> > > > >
> > > > > I think this is way to restrictive.
> > > >
> > > > kmap_local_page() holds migrate_disable() over the entire mapping, so
> > > > we're only talking about preempt_disable(). I tend to think that
> > > > bloating task_struct for something that is rarely used "kmap on dax
> > > > pmem pages" is not the right tradeoff.
> > >
> > > Now, I can see an argument that promoting kmap_local_page() to
> > > preempt_disable() could cause problems, but I'd like help confirming
> > > that before committing to extending task_struct.
> >
> > ...as I say that it occurs to me that the whole point of
> > kmap_local_page() is to be better than kmap_atomic() and this undoes
> > that. I'd at least like that documented as the reason that task_struct
> > needs to carry a new field.
>
> I'll try and update the commit message but kmap_local_page() only disables
> migrate on a highmem system.

Right, but that still means that the code in question is prepared for
migrate_disable(). Instead I would justify the task_struct expansion
on the observation that moving the enable tracker to percpu would
require pre-empt disable and promote kmap_local_page() to
kmap_atomic() level of restriction. Otherwise if there was a way to
avoid task_struct expansion with only migrate_disable() and not
preempt_disable I think it would be worth it, but nothing comes to
mind at the moment.

> The devmap/PKS use case is specifically not
> supported on highmem systems.  Mainly because on highmem systems
> kmap_local_page() actually creates a new mapping which is not covered by PKS
> anyway.
>
> So for the devmap/PKS use case kmap_local_page() is defined as:
>
>  static inline void *kmap_local_page(struct page *page)
>  {
> +       pgmap_mk_readwrite(page);
>         return page_address(page);
>  }
>
> ...for the linear mapping.  I'll try and update the commit message with this
> detail.

Yeah, add the kmap_atomic() observation, not the !highmem note.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 36/44] memremap_pages: Reserve a PKS PKey for eventual use by PMEM
  2022-02-05  5:40       ` Ira Weiny
@ 2022-02-05  8:19         ` Dan Williams
  2022-02-06 18:14           ` Dan Williams
  2022-02-08 22:48           ` Ira Weiny
  0 siblings, 2 replies; 145+ messages in thread
From: Dan Williams @ 2022-02-05  8:19 UTC (permalink / raw)
  To: Ira Weiny; +Cc: Edgecombe, Rick P, hpa, dave.hansen, Yu, Fenghua, linux-kernel

On Fri, Feb 4, 2022 at 9:40 PM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Fri, Feb 04, 2022 at 09:12:11AM -0800, Dan Williams wrote:
> > On Tue, Feb 1, 2022 at 10:35 AM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> > > >  enum pks_pkey_consumers {
> > > > -       PKS_KEY_DEFAULT         = 0, /* Must be 0 for default PTE
> > > > values */
> > > > -       PKS_KEY_TEST            = 1,
> > > > -       PKS_KEY_NR_CONSUMERS    = 2,
> > > > +       PKS_KEY_DEFAULT                 = 0, /* Must be 0 for default
> > > > PTE values */
> > > > +       PKS_KEY_TEST                    = 1,
> > > > +       PKS_KEY_PGMAP_PROTECTION        = 2,
> > > > +       PKS_KEY_NR_CONSUMERS            = 3,
> > > >  };
> > >
> > > The c spec says that any enum member that doesn't have an "=" will be
> > > one more than the previous member. As a consequence you can leave the
> > > "=" off PKS_KEY_NR_CONSUMERS and it will get auto adjusted when you add
> > > more like this.
> > >
> > > I know we've gone around and around on this, but why also specify the
> > > value for each key? They should auto increment and the first one is
> > > guaranteed to be zero.
>
> Because it was easier to ensure that the init value had all the defaults
> covered.
>
> > >
> > > Otherwise this doesn't use any of the features of "enum", it's just a
> > > verbose series of const int's.
>
> True but does this really matter?
>
> >
> > Going further, this can also build in support for dynamically (at
> > build time) freeing keys based on config, something like:
> >
> > enum {
> > #if IS_ENABLED(CONFIG_PKS_TEST)
> > PKS_KEY_TEST,
> > #endif
> > #if IS_ENABLED(CONFIG_DEVMAP_PROTECTION)
> > PKS_KEY_PGMAP_PROTECTION,
> > #endif
> > PKS_KEY_NR_CONSUMERS,
> > }
>
> This is all well and good until you get to the point of trying to define the
> initial MSR value.
>
> What Rick proposes without the Kconfig check is easier than this.  But to do
> what both you and Rick suggest this is the best crap I've been able to come up
> with that actually works...
>
>
> /* pkeys_common.h */
> #define PKR_AD_BIT 0x1u
> #define PKR_WD_BIT 0x2u
> #define PKR_BITS_PER_PKEY 2
>
> #define PKR_PKEY_SHIFT(pkey)    (pkey * PKR_BITS_PER_PKEY)
>
> #define PKR_KEY_INIT_RW(pkey)   (0          << PKR_PKEY_SHIFT(pkey))
> #define PKR_KEY_INIT_AD(pkey)   (PKR_AD_BIT << PKR_PKEY_SHIFT(pkey))
> #define PKR_KEY_INIT_WD(pkey)   (PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))
>
>
> /* pks-keys.h */
> #define PKR_KEY_MASK(pkey)   (0xffffffff & ~((PKR_WD_BIT|PKR_AD_BIT) << PKR_PKEY_SHIFT(pkey)))
>
> enum pks_pkey_consumers {
>         PKS_KEY_DEFAULT                 = 0, /* Must be 0 for default PTE values */
> #if IS_ENABLED(CONFIG_PKS_TEST)
>         PKS_KEY_TEST,
> #endif
> #if IS_ENABLED(CONFIG_DEVMAP_ACCESS_PROTECTION)
>         PKS_KEY_PGMAP_PROTECTION,
> #endif
>         PKS_KEY_NR_CONSUMERS
> };
>
> #define PKS_DEFAULT_VALUE PKR_KEY_INIT_RW(PKS_KEY_DEFAULT)
> #define PKS_DEFAULT_MASK  PKR_KEY_MASK(PKS_KEY_DEFAULT)
>
> #if IS_ENABLED(CONFIG_PKS_TEST)
> #define PKS_TEST_VALUE PKR_KEY_INIT_AD(PKS_KEY_TEST)
> #define PKS_TEST_MASK  PKR_KEY_MASK(PKS_KEY_TEST)
> #else
> /* Just define another default value to fool the CPP */
> #define PKS_TEST_VALUE PKR_KEY_INIT_RW(0)
> #define PKS_TEST_MASK  PKR_KEY_MASK(0)
> #endif
>
> #if IS_ENABLED(CONFIG_DEVMAP_ACCESS_PROTECTION)
> #define PKS_PGMAP_VALUE PKR_KEY_INIT_AD(PKS_KEY_PGMAP_PROTECTION)
> #define PKS_PGMAP_MASK  PKR_KEY_MASK(PKS_KEY_PGMAP_PROTECTION)
> #else
> /* Just define another default value to fool the CPP */
> #define PKS_PGMAP_VALUE PKR_KEY_INIT_RW(0)
> #define PKS_PGMAP_MASK  PKR_KEY_MASK(0)
> #endif
>
> #define PKS_INIT_VALUE ((0xFFFFFFFF & \
>                         (PKS_DEFAULT_MASK & \
>                                 PKS_TEST_MASK & \
>                                 PKS_PGMAP_MASK \
>                         )) | \
>                         (PKS_DEFAULT_VALUE | \
>                         PKS_TEST_VALUE | \
>                         PKS_PGMAP_VALUE \
>                         ) \
>                         )
>
>
> I find the above much harder to parse and of little value.  I'm pretty sure
> that someone adding a key is much more likely to get the macro maze wrong.
> Reviewing a patch to add a key would be much more difficult as well, IMHO.
>
> I'm a bit tired of this back and forth trying to implement this for features
> which may never exist.  In all my discussions I don't think we have reached
> more than 10 use cases in our wildest dreams and only 4 have even been
> attempted with real code including the PKS test.
>
> So I'm just a bit frustrated with the effort we have put in to try and do
> dynamic or even compile time dynamic keys.
>
> Anyway I'll think on it more.  But I'm inclined to leave it alone right now.
> What I have is easy to review for correctness and only takes a bit of effort to
> actually use.

Sorry, for the thrash Ira, I felt compelled to put some skin in the
game and came up with the below. Ignore the names they are just to
show the idea:

#define KEYVAL(prev, config) (prev + __is_defined(config))
#define KEYDEFAULT(key, default, config) ((default << (key * 2)) *
__is_defined(config))
#define KEYX KEYVAL(0, ENABLE_X)
#define KEYY KEYVAL(KEYX, ENABLE_Y)
#define KEYZ KEYVAL(KEYY, ENABLE_Z)
#define KEYMAX KEYVAL(KEYZ, 1)

#define KEY0_INIT 0x1
#define KEYX_INIT KEYDEFAULT(KEYX, 2, ENABLE_X)
#define KEYY_INIT KEYDEFAULT(KEYY, 2, ENABLE_Y)
#define KEYZ_INIT KEYDEFAULT(KEYZ, 2, ENABLE_Z)

#define ALL_AD 0x55555555
#define DEFAULT (ALL_AD & (GENMASK(31, KEYMAX * 2))) | KEYX_INIT |
KEYY_INIT | KEYZ_INIT | KEY0_INIT

The idea is that this relies on a defined key order, but still
dynamically assigns key values at compile time. It's not as simple as
an enum, but it still seems readable to me.

The definition is done in 2 parts key slot numbers (KEYVAL()) and the
values they need to inject into the combined global init mask
(KEYDEFAULT()). The KEYVAL() section above defines the key order where
each key gets an incremented value in the order, but only if the
previous key was defined. KEYDEFAULT() defines the init value and
optionally zeros out the init value if the config is disabled. Finally
the DEFAULT mask is initialized to all 5s if there are zero keys
defined, but otherwise masks 2-bits per defined key + 1 and ORs in the
corresponding key init values. This uses some of the magic behind the
IS_ENABLED() macro to turn undefined symbols into 0s.

It seems to work, for example if I just define KEYX and KEYZ,

#define ENABLE_X 1
#define ENABLE_Z 1

        printf("X: %d:%#x Y: %d:%#x Z: %d:%#x MAX: %d DEFAULT: %#x\n", KEYX,
               KEYX_INIT, KEYY, KEYY_INIT, KEYZ, KEYZ_INIT, KEYMAX, DEFAULT);

# ./a.out
X: 1:0x8 Y: 1:0 Z: 2:0x20 MAX: 3 DEFAULT: 0x55555569

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 36/44] memremap_pages: Reserve a PKS PKey for eventual use by PMEM
  2022-02-05  8:19         ` Dan Williams
@ 2022-02-06 18:14           ` Dan Williams
  2022-02-08 22:48           ` Ira Weiny
  1 sibling, 0 replies; 145+ messages in thread
From: Dan Williams @ 2022-02-06 18:14 UTC (permalink / raw)
  To: Ira Weiny; +Cc: Edgecombe, Rick P, hpa, dave.hansen, Yu, Fenghua, linux-kernel

On Sat, Feb 5, 2022 at 12:19 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Fri, Feb 4, 2022 at 9:40 PM Ira Weiny <ira.weiny@intel.com> wrote:
> >
> > On Fri, Feb 04, 2022 at 09:12:11AM -0800, Dan Williams wrote:
> > > On Tue, Feb 1, 2022 at 10:35 AM Edgecombe, Rick P
> > > <rick.p.edgecombe@intel.com> wrote:
> > > >
> > > > On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> > > > >  enum pks_pkey_consumers {
> > > > > -       PKS_KEY_DEFAULT         = 0, /* Must be 0 for default PTE
> > > > > values */
> > > > > -       PKS_KEY_TEST            = 1,
> > > > > -       PKS_KEY_NR_CONSUMERS    = 2,
> > > > > +       PKS_KEY_DEFAULT                 = 0, /* Must be 0 for default
> > > > > PTE values */
> > > > > +       PKS_KEY_TEST                    = 1,
> > > > > +       PKS_KEY_PGMAP_PROTECTION        = 2,
> > > > > +       PKS_KEY_NR_CONSUMERS            = 3,
> > > > >  };
> > > >
> > > > The c spec says that any enum member that doesn't have an "=" will be
> > > > one more than the previous member. As a consequence you can leave the
> > > > "=" off PKS_KEY_NR_CONSUMERS and it will get auto adjusted when you add
> > > > more like this.
> > > >
> > > > I know we've gone around and around on this, but why also specify the
> > > > value for each key? They should auto increment and the first one is
> > > > guaranteed to be zero.
> >
> > Because it was easier to ensure that the init value had all the defaults
> > covered.
> >
> > > >
> > > > Otherwise this doesn't use any of the features of "enum", it's just a
> > > > verbose series of const int's.
> >
> > True but does this really matter?
> >
> > >
> > > Going further, this can also build in support for dynamically (at
> > > build time) freeing keys based on config, something like:
> > >
> > > enum {
> > > #if IS_ENABLED(CONFIG_PKS_TEST)
> > > PKS_KEY_TEST,
> > > #endif
> > > #if IS_ENABLED(CONFIG_DEVMAP_PROTECTION)
> > > PKS_KEY_PGMAP_PROTECTION,
> > > #endif
> > > PKS_KEY_NR_CONSUMERS,
> > > }
> >
> > This is all well and good until you get to the point of trying to define the
> > initial MSR value.
> >
> > What Rick proposes without the Kconfig check is easier than this.  But to do
> > what both you and Rick suggest this is the best crap I've been able to come up
> > with that actually works...
> >
> >
> > /* pkeys_common.h */
> > #define PKR_AD_BIT 0x1u
> > #define PKR_WD_BIT 0x2u
> > #define PKR_BITS_PER_PKEY 2
> >
> > #define PKR_PKEY_SHIFT(pkey)    (pkey * PKR_BITS_PER_PKEY)
> >
> > #define PKR_KEY_INIT_RW(pkey)   (0          << PKR_PKEY_SHIFT(pkey))
> > #define PKR_KEY_INIT_AD(pkey)   (PKR_AD_BIT << PKR_PKEY_SHIFT(pkey))
> > #define PKR_KEY_INIT_WD(pkey)   (PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))
> >
> >
> > /* pks-keys.h */
> > #define PKR_KEY_MASK(pkey)   (0xffffffff & ~((PKR_WD_BIT|PKR_AD_BIT) << PKR_PKEY_SHIFT(pkey)))
> >
> > enum pks_pkey_consumers {
> >         PKS_KEY_DEFAULT                 = 0, /* Must be 0 for default PTE values */
> > #if IS_ENABLED(CONFIG_PKS_TEST)
> >         PKS_KEY_TEST,
> > #endif
> > #if IS_ENABLED(CONFIG_DEVMAP_ACCESS_PROTECTION)
> >         PKS_KEY_PGMAP_PROTECTION,
> > #endif
> >         PKS_KEY_NR_CONSUMERS
> > };
> >
> > #define PKS_DEFAULT_VALUE PKR_KEY_INIT_RW(PKS_KEY_DEFAULT)
> > #define PKS_DEFAULT_MASK  PKR_KEY_MASK(PKS_KEY_DEFAULT)
> >
> > #if IS_ENABLED(CONFIG_PKS_TEST)
> > #define PKS_TEST_VALUE PKR_KEY_INIT_AD(PKS_KEY_TEST)
> > #define PKS_TEST_MASK  PKR_KEY_MASK(PKS_KEY_TEST)
> > #else
> > /* Just define another default value to fool the CPP */
> > #define PKS_TEST_VALUE PKR_KEY_INIT_RW(0)
> > #define PKS_TEST_MASK  PKR_KEY_MASK(0)
> > #endif
> >
> > #if IS_ENABLED(CONFIG_DEVMAP_ACCESS_PROTECTION)
> > #define PKS_PGMAP_VALUE PKR_KEY_INIT_AD(PKS_KEY_PGMAP_PROTECTION)
> > #define PKS_PGMAP_MASK  PKR_KEY_MASK(PKS_KEY_PGMAP_PROTECTION)
> > #else
> > /* Just define another default value to fool the CPP */
> > #define PKS_PGMAP_VALUE PKR_KEY_INIT_RW(0)
> > #define PKS_PGMAP_MASK  PKR_KEY_MASK(0)
> > #endif
> >
> > #define PKS_INIT_VALUE ((0xFFFFFFFF & \
> >                         (PKS_DEFAULT_MASK & \
> >                                 PKS_TEST_MASK & \
> >                                 PKS_PGMAP_MASK \
> >                         )) | \
> >                         (PKS_DEFAULT_VALUE | \
> >                         PKS_TEST_VALUE | \
> >                         PKS_PGMAP_VALUE \
> >                         ) \
> >                         )
> >
> >
> > I find the above much harder to parse and of little value.  I'm pretty sure
> > that someone adding a key is much more likely to get the macro maze wrong.
> > Reviewing a patch to add a key would be much more difficult as well, IMHO.
> >
> > I'm a bit tired of this back and forth trying to implement this for features
> > which may never exist.  In all my discussions I don't think we have reached
> > more than 10 use cases in our wildest dreams and only 4 have even been
> > attempted with real code including the PKS test.
> >
> > So I'm just a bit frustrated with the effort we have put in to try and do
> > dynamic or even compile time dynamic keys.
> >
> > Anyway I'll think on it more.  But I'm inclined to leave it alone right now.
> > What I have is easy to review for correctness and only takes a bit of effort to
> > actually use.
>
> Sorry, for the thrash Ira, I felt compelled to put some skin in the
> game and came up with the below. Ignore the names they are just to
> show the idea:

I should add that the reason that I think this is important is to
allow key scaling *within* use cases. For example, why should one
thread get access to all PMEM in a kmap_local_page() section? A simple
extension would be to use several keys, hashed by namespace, to give
finer grained protection. Another idea that scales with keys is to use
PKS faults to sample accesses by memory type or namespace. In other
words a handful of use cases can expand to exhaust all the keys for
finer grained access control / samping.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 36/44] memremap_pages: Reserve a PKS PKey for eventual use by PMEM
  2022-02-05  8:19         ` Dan Williams
  2022-02-06 18:14           ` Dan Williams
@ 2022-02-08 22:48           ` Ira Weiny
  2022-02-08 23:22             ` Dan Williams
  1 sibling, 1 reply; 145+ messages in thread
From: Ira Weiny @ 2022-02-08 22:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Edgecombe, Rick P, hpa, dave.hansen, Yu, Fenghua, linux-kernel

On Sat, Feb 05, 2022 at 12:19:27AM -0800, Dan Williams wrote:
> On Fri, Feb 4, 2022 at 9:40 PM Ira Weiny <ira.weiny@intel.com> wrote:
> >
> > On Fri, Feb 04, 2022 at 09:12:11AM -0800, Dan Williams wrote:
> > > On Tue, Feb 1, 2022 at 10:35 AM Edgecombe, Rick P
> > > <rick.p.edgecombe@intel.com> wrote:
> > > >
> > > > On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> > > > >  enum pks_pkey_consumers {
> > > > > -       PKS_KEY_DEFAULT         = 0, /* Must be 0 for default PTE
> > > > > values */
> > > > > -       PKS_KEY_TEST            = 1,
> > > > > -       PKS_KEY_NR_CONSUMERS    = 2,
> > > > > +       PKS_KEY_DEFAULT                 = 0, /* Must be 0 for default
> > > > > PTE values */
> > > > > +       PKS_KEY_TEST                    = 1,
> > > > > +       PKS_KEY_PGMAP_PROTECTION        = 2,
> > > > > +       PKS_KEY_NR_CONSUMERS            = 3,
> > > > >  };
> > > >
> > > > The c spec says that any enum member that doesn't have an "=" will be
> > > > one more than the previous member. As a consequence you can leave the
> > > > "=" off PKS_KEY_NR_CONSUMERS and it will get auto adjusted when you add
> > > > more like this.
> > > >
> > > > I know we've gone around and around on this, but why also specify the
> > > > value for each key? They should auto increment and the first one is
> > > > guaranteed to be zero.
> >
> > Because it was easier to ensure that the init value had all the defaults
> > covered.
> >
> > > >
> > > > Otherwise this doesn't use any of the features of "enum", it's just a
> > > > verbose series of const int's.
> >
> > True but does this really matter?
> >
> > >
> > > Going further, this can also build in support for dynamically (at
> > > build time) freeing keys based on config, something like:
> > >
> > > enum {
> > > #if IS_ENABLED(CONFIG_PKS_TEST)
> > > PKS_KEY_TEST,
> > > #endif
> > > #if IS_ENABLED(CONFIG_DEVMAP_PROTECTION)
> > > PKS_KEY_PGMAP_PROTECTION,
> > > #endif
> > > PKS_KEY_NR_CONSUMERS,
> > > }
> >
> > This is all well and good until you get to the point of trying to define the
> > initial MSR value.
> >
> > What Rick proposes without the Kconfig check is easier than this.  But to do
> > what both you and Rick suggest this is the best crap I've been able to come up
> > with that actually works...
> >
> >
> > /* pkeys_common.h */
> > #define PKR_AD_BIT 0x1u
> > #define PKR_WD_BIT 0x2u
> > #define PKR_BITS_PER_PKEY 2
> >
> > #define PKR_PKEY_SHIFT(pkey)    (pkey * PKR_BITS_PER_PKEY)
> >
> > #define PKR_KEY_INIT_RW(pkey)   (0          << PKR_PKEY_SHIFT(pkey))
> > #define PKR_KEY_INIT_AD(pkey)   (PKR_AD_BIT << PKR_PKEY_SHIFT(pkey))
> > #define PKR_KEY_INIT_WD(pkey)   (PKR_WD_BIT << PKR_PKEY_SHIFT(pkey))
> >
> >
> > /* pks-keys.h */
> > #define PKR_KEY_MASK(pkey)   (0xffffffff & ~((PKR_WD_BIT|PKR_AD_BIT) << PKR_PKEY_SHIFT(pkey)))
> >
> > enum pks_pkey_consumers {
> >         PKS_KEY_DEFAULT                 = 0, /* Must be 0 for default PTE values */
> > #if IS_ENABLED(CONFIG_PKS_TEST)
> >         PKS_KEY_TEST,
> > #endif
> > #if IS_ENABLED(CONFIG_DEVMAP_ACCESS_PROTECTION)
> >         PKS_KEY_PGMAP_PROTECTION,
> > #endif
> >         PKS_KEY_NR_CONSUMERS
> > };
> >
> > #define PKS_DEFAULT_VALUE PKR_KEY_INIT_RW(PKS_KEY_DEFAULT)
> > #define PKS_DEFAULT_MASK  PKR_KEY_MASK(PKS_KEY_DEFAULT)
> >
> > #if IS_ENABLED(CONFIG_PKS_TEST)
> > #define PKS_TEST_VALUE PKR_KEY_INIT_AD(PKS_KEY_TEST)
> > #define PKS_TEST_MASK  PKR_KEY_MASK(PKS_KEY_TEST)
> > #else
> > /* Just define another default value to fool the CPP */
> > #define PKS_TEST_VALUE PKR_KEY_INIT_RW(0)
> > #define PKS_TEST_MASK  PKR_KEY_MASK(0)
> > #endif
> >
> > #if IS_ENABLED(CONFIG_DEVMAP_ACCESS_PROTECTION)
> > #define PKS_PGMAP_VALUE PKR_KEY_INIT_AD(PKS_KEY_PGMAP_PROTECTION)
> > #define PKS_PGMAP_MASK  PKR_KEY_MASK(PKS_KEY_PGMAP_PROTECTION)
> > #else
> > /* Just define another default value to fool the CPP */
> > #define PKS_PGMAP_VALUE PKR_KEY_INIT_RW(0)
> > #define PKS_PGMAP_MASK  PKR_KEY_MASK(0)
> > #endif
> >
> > #define PKS_INIT_VALUE ((0xFFFFFFFF & \
> >                         (PKS_DEFAULT_MASK & \
> >                                 PKS_TEST_MASK & \
> >                                 PKS_PGMAP_MASK \
> >                         )) | \
> >                         (PKS_DEFAULT_VALUE | \
> >                         PKS_TEST_VALUE | \
> >                         PKS_PGMAP_VALUE \
> >                         ) \
> >                         )
> >
> >
> > I find the above much harder to parse and of little value.  I'm pretty sure
> > that someone adding a key is much more likely to get the macro maze wrong.
> > Reviewing a patch to add a key would be much more difficult as well, IMHO.
> >
> > I'm a bit tired of this back and forth trying to implement this for features
> > which may never exist.  In all my discussions I don't think we have reached
> > more than 10 use cases in our wildest dreams and only 4 have even been
> > attempted with real code including the PKS test.
> >
> > So I'm just a bit frustrated with the effort we have put in to try and do
> > dynamic or even compile time dynamic keys.
> >
> > Anyway I'll think on it more.  But I'm inclined to leave it alone right now.
> > What I have is easy to review for correctness and only takes a bit of effort to
> > actually use.
> 
> Sorry, for the thrash Ira, I felt compelled to put some skin in the
> game and came up with the below. Ignore the names they are just to
> show the idea:
> 
> #define KEYVAL(prev, config) (prev + __is_defined(config))
> #define KEYDEFAULT(key, default, config) ((default << (key * 2)) *
> __is_defined(config))
> #define KEYX KEYVAL(0, ENABLE_X)
> #define KEYY KEYVAL(KEYX, ENABLE_Y)
> #define KEYZ KEYVAL(KEYY, ENABLE_Z)
> #define KEYMAX KEYVAL(KEYZ, 1)
> 
> #define KEY0_INIT 0x1
> #define KEYX_INIT KEYDEFAULT(KEYX, 2, ENABLE_X)
> #define KEYY_INIT KEYDEFAULT(KEYY, 2, ENABLE_Y)
> #define KEYZ_INIT KEYDEFAULT(KEYZ, 2, ENABLE_Z)
> 
> #define ALL_AD 0x55555555
> #define DEFAULT (ALL_AD & (GENMASK(31, KEYMAX * 2))) | KEYX_INIT |
> KEYY_INIT | KEYZ_INIT | KEY0_INIT
> 
> The idea is that this relies on a defined key order, but still
> dynamically assigns key values at compile time. It's not as simple as
> an enum, but it still seems readable to me.
> 
> The definition is done in 2 parts key slot numbers (KEYVAL()) and the
> values they need to inject into the combined global init mask
> (KEYDEFAULT()). The KEYVAL() section above defines the key order where
> each key gets an incremented value in the order, but only if the
> previous key was defined. KEYDEFAULT() defines the init value and
> optionally zeros out the init value if the config is disabled. Finally
> the DEFAULT mask is initialized to all 5s if there are zero keys
> defined, but otherwise masks 2-bits per defined key + 1 and ORs in the
> corresponding key init values. This uses some of the magic behind the
> IS_ENABLED() macro to turn undefined symbols into 0s.
> 
> It seems to work, for example if I just define KEYX and KEYZ,
> 
> #define ENABLE_X 1
> #define ENABLE_Z 1
> 
>         printf("X: %d:%#x Y: %d:%#x Z: %d:%#x MAX: %d DEFAULT: %#x\n", KEYX,
>                KEYX_INIT, KEYY, KEYY_INIT, KEYZ, KEYZ_INIT, KEYMAX, DEFAULT);
> 
> # ./a.out
> X: 1:0x8 Y: 1:0 Z: 2:0x20 MAX: 3 DEFAULT: 0x55555569

Yes this seems to work.  I still think it is a bit clunky, but after
documenting it I think it is clear enough.  I also played with making it a bit
more straight forward with the macro names.

Here is the doc:

/**
 * DOC: PKS_KEY_ALLOCATION
 *
 * Users reserve a key value in 5 steps.
 *      1) Use PKS_NEW_KEY to create a new key
 *      2) Ensure that the last key value is used in the PKS_NEW_KEY macro
 *      3) Adjust PKS_KEY_MAX to use the newly defined key value
 *      4) Use PKS_DECLARE_INIT_VALUE to define an initial value
 *      5) Add the new PKS default value to PKS_INIT_VALUE
 *
 * The PKS_NEW_KEY and PKS_DECLARE_INIT_VALUE macros require the Kconfig
 * option to be specified to automatically adjust the number of keys used.
 *
 * PKS_KEY_DEFAULT must remain 0 (prev = 0) key with a default of
 * PKS_DECLARE_DEFAULT_RW support non-pks protected pages.
 *
 * For example to configure a key for 'MY_FEATURE' with a default of Write
 * Disabled.
 *
 * .. code-block:: c
 *
 *      #define PKS_KEY_DEFAULT    PKS_NEW_KEY(0, 1)
 *
 *      // 1) Add PKS_KEY_MY_FEATURE
 *      // 2) Be sure to use the last defined key in the macro
 *      #define PKS_KEY_MY_FEATURE PKS_NEW_KEY(PKS_KEY_DEFAULT, CONFIG_MY_FEATURE)
 *
 *      // 3) Adjust PKS_KEY_MAX
 *      #define PKS_KEY_MAX        PKS_NEW_KEY(PKS_KEY_MY_FEATURE, 1)
 *
 *
 *      #define PKS_KEY_DEFAULT_INIT    PKS_DECLARE_INIT_VALUE(PKS_KEY_DEFAULT, RW, 1)
 *
 *      // 4) Define initial value
 *      #define PKS_KEY_MY_FEATURE_INIT PKS_DECLARE_INIT_VALUE(PKS_KEY_MY_FEATURE, WD, CONFIG_MY_FEATURE)
 *
 *      // 5) Add initial value to PKS_INIT_VALUE
 *      #define PKS_INIT_VALUE ((PKS_ALL_AD & (GENMASK(31, PKS_KEY_MAX * PKR_BITS_PER_PKEY))) | \
 *                              PKS_KEY_DEFAULT_INIT | \
 *                              PKS_KEY_MY_FEATURE_INIT | \
 *                              )
 */


Let me know if this is clear enough?
Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 36/44] memremap_pages: Reserve a PKS PKey for eventual use by PMEM
  2022-02-08 22:48           ` Ira Weiny
@ 2022-02-08 23:22             ` Dan Williams
  2022-02-08 23:42               ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dan Williams @ 2022-02-08 23:22 UTC (permalink / raw)
  To: Ira Weiny; +Cc: Edgecombe, Rick P, hpa, dave.hansen, Yu, Fenghua, linux-kernel

On Tue, Feb 8, 2022 at 2:48 PM Ira Weiny <ira.weiny@intel.com> wrote:
[..]
>  *      // 5) Add initial value to PKS_INIT_VALUE
>  *      #define PKS_INIT_VALUE ((PKS_ALL_AD & (GENMASK(31, PKS_KEY_MAX * PKR_BITS_PER_PKEY))) | \
>  *                              PKS_KEY_DEFAULT_INIT | \
>  *                              PKS_KEY_MY_FEATURE_INIT | \

Does this compile? I.e. can you have a '|' operator with nothing on
the right hand side?

>  *                              )
>  */
>
>
> Let me know if this is clear enough?

Looks good to me.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 36/44] memremap_pages: Reserve a PKS PKey for eventual use by PMEM
  2022-02-08 23:22             ` Dan Williams
@ 2022-02-08 23:42               ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-08 23:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Edgecombe, Rick P, hpa, dave.hansen, Yu, Fenghua, linux-kernel

On Tue, Feb 08, 2022 at 03:22:22PM -0800, Dan Williams wrote:
> On Tue, Feb 8, 2022 at 2:48 PM Ira Weiny <ira.weiny@intel.com> wrote:
> [..]
> >  *      // 5) Add initial value to PKS_INIT_VALUE
> >  *      #define PKS_INIT_VALUE ((PKS_ALL_AD & (GENMASK(31, PKS_KEY_MAX * PKR_BITS_PER_PKEY))) | \
> >  *                              PKS_KEY_DEFAULT_INIT | \
> >  *                              PKS_KEY_MY_FEATURE_INIT | \
> 
> Does this compile? I.e. can you have a '|' operator with nothing on
> the right hand side?

Oops... yes but only because this is a comment (kernel doc) from the compiler
POV.  Thanks I'll fix it.

> 
> >  *                              )
> >  */
> >
> >
> > Let me know if this is clear enough?
> 
> Looks good to me.

Thanks,
Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS
  2022-02-04 19:08         ` Ira Weiny
@ 2022-02-09  5:34           ` Ira Weiny
  2022-02-14 19:20             ` Dave Hansen
  0 siblings, 1 reply; 145+ messages in thread
From: Ira Weiny @ 2022-02-09  5:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Feb 04, 2022 at 11:08:51AM -0800, 'Ira Weiny' wrote:
> On Fri, Jan 28, 2022 at 03:51:56PM -0800, Dave Hansen wrote:
> > On 1/28/22 15:10, Ira Weiny wrote:
> > > This issue is that because PKS users are in kernel only and are not part of the
> > > architecture specific code there needs to be 2 mechanisms within the Kconfig
> > > structure.  One to communicate an architectures support PKS such that the user
> > > who needs it can depend on that config as well as a second to allow that user
> > > to communicate back to the architecture to enable PKS.
> > 
> > I *think* the point here is to ensure that PKS isn't compiled in unless
> > it is supported *AND* needed.
> 
> Yes.
> 
> > You have to have architecture support
> > (ARCH_HAS_SUPERVISOR_PKEYS) to permit features that depend on PKS to be
> > enabled.  Then, once one ore more of *THOSE* is enabled,
> > ARCH_ENABLE_SUPERVISOR_PKEYS comes into play and actually compiles the
> > feature in.
> > 
> > In other words, there are two things that must happen before the code
> > gets compiled in:
> > 
> > 1. Arch support
> > 2. One or more features to use the arch support
> 
> Yes.  I really think we are both say the same thing with different words.

Is the following more clear?

<commit>

PKS is only useful to kernel consumers and is only available on some
architectures.  If no kernel consumers are configured or PKS support is
not available the PKS code can be eliminated from the compile.

Define a Kconfig structure which allows kernel consumers to detect
architecture support (ARCH_HAS_SUPERVISOR_PKEYS) and, if available,
indicate that PKS should be compiled in (ARCH_ENABLE_SUPERVISOR_PKEYS).

In this patch ARCH_ENABLE_SUPERVISOR_PKEYS remains off until the first
kernel consumer sets it.

</commit>

Thanks,
Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 13/44] mm/pkeys: Add initial PKS Test code
  2022-01-31 19:30   ` Edgecombe, Rick P
@ 2022-02-09 23:44     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-09 23:44 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: hpa, Williams, Dan J, dave.hansen, Yu, Fenghua, linux-kernel

On Mon, Jan 31, 2022 at 11:30:14AM -0800, Edgecombe, Rick P wrote:
> On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> > +static void crash_it(void)
> > +{
> > +       struct pks_test_ctx *ctx;
> > +       void *ptr;
> > +
> > +       pr_warn("     ***** BEGIN: Unhandled fault test *****\n");
> > +
> > +       ctx = alloc_ctx(PKS_KEY_TEST);
> > +       if (IS_ERR(ctx)) {
> > +               pr_err("Failed to allocate context???\n");
> > +               return;
> > +       }
> > +
> > +       ptr = alloc_test_page(ctx->pkey);
> > +       if (!ptr) {
> > +               pr_err("Failed to vmalloc page???\n");
> > +               return;
> > +       }
> > +
> > +       /* This purposely faults */
> > +       memcpy(ptr, ctx->data, 8);
> > +
> > +       /* Should never get here if so the test failed */
> > +       last_test_pass = false;
> > +
> > +       vfree(ptr);
> > +       free_ctx(ctx);
> 
> So these only gets cleaned up if the test fails? Could you clean them
> up in pks_release_file() like the later test patch?

Not a bad idea.  Although if someone is running this they are most likely not
concerned with it.

> 
> > +}
> 
> snip
> 
> > +
> > +static void __exit pks_test_exit(void)
> > +{
> > +       debugfs_remove(pks_test_dentry);
> > +       pr_info("test exit\n");
> > +}
> 
> How does this get called?

Left over from when this was a module.  Thanks for catching.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 15/44] x86/pkeys: Preserve the PKS MSR on context switch
  2022-01-29  0:22   ` Dave Hansen
@ 2022-02-11  6:10     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-11  6:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Jan 28, 2022 at 04:22:42PM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > The PKS MSR (PKRS) is defined as a per-logical-processor register.  This
> 
> s/defined as//

Done.

> 
> > isolates memory access by logical CPU.  
> 
> This second sentence is a bit confusing to me.  I *think* you're trying
> to say that PKRS only affects accesses from one logical CPU.

Yes.

> But, it
> just comes out strangely.  I think I'd just axe the sentence.

Yea done.

> 
> > Unfortunately, the MSR is not
> > managed by XSAVE.  Therefore, tasks must save/restore the MSR value on
> > context switch.
> > 
> > Define pks_saved_pkrs in struct thread_struct.  Initialize all tasks,
> > including the init_task, with the PKS_INIT_VALUE when created.  Restore
> > the CPU's MSR to the saved task value on schedule in.
> > 
> > pks_write_current() is added to ensures non-supervisor pkey
> 
> 				  ^ ensure

Done.

> 
> ...
> > diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> > index 2c5f12ae7d04..3530a0e50b4f 100644
> > --- a/arch/x86/include/asm/processor.h
> > +++ b/arch/x86/include/asm/processor.h
> > @@ -2,6 +2,8 @@
> >  #ifndef _ASM_X86_PROCESSOR_H
> >  #define _ASM_X86_PROCESSOR_H
> >  
> > +#include <linux/pks-keys.h>
> > +
> >  #include <asm/processor-flags.h>
> >  
> >  /* Forward declaration, a strange C thing */
> > @@ -502,6 +504,12 @@ struct thread_struct {
> >  	unsigned long		cr2;
> >  	unsigned long		trap_nr;
> >  	unsigned long		error_code;
> > +
> > +#ifdef	CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> > +	/* Saved Protection key register for supervisor mappings */
> > +	u32			pks_saved_pkrs;
> > +#endif
> 
> There are a bunch of other "saved" registers in thread_struct.  They all
> just have their register name, including pkru.
> 
> Can you just stick this next to 'pkru' and call it plain old 'pkrs'?

Sure.  I was trying to use the same 'pks_*' prefix everywhere.  But pkrs makes
sense too.

> That will probably even take up less space than this since the two
> 32-bit values can be packed together.

Yes done.

[]

> > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > index 3402edec236c..81fc0b638308 100644
> > --- a/arch/x86/kernel/process_64.c
> > +++ b/arch/x86/kernel/process_64.c
> > @@ -59,6 +59,7 @@
> >  /* Not included via unistd.h */
> >  #include <asm/unistd_32_ia32.h>
> >  #endif
> > +#include <asm/pks.h>
> >  
> >  #include "process.h"
> >  
> > @@ -657,6 +658,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> >  	/* Load the Intel cache allocation PQR MSR. */
> >  	resctrl_sched_in();
> >  
> > +	pks_write_current();
> > +
> >  	return prev_p;
> >  }
> 
> At least for pkru and fsgsbase, these have the form:
> 
> 	x86_<register>_load();
> 
> Should this be
> 
> 	x86_pkrs_load();

Ok done.

> 
> and be located next to:
> 
> 	x86_pkru_load()?

This presents a problem.  As defined this can't happen until current is loaded.
For now I've passed in the next thread_struct but I fear that is going to cause
some bad header dependencies.  I'll see what 0day has to say about it and
adjust as needed.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 18/44] x86/fault: Add a PKS test fault hook
  2022-01-31 19:56   ` Edgecombe, Rick P
@ 2022-02-11 20:40     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-11 20:40 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: hpa, Williams, Dan J, dave.hansen, Yu, Fenghua, linux-kernel

On Mon, Jan 31, 2022 at 11:56:57AM -0800, Edgecombe, Rick P wrote:
> On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> > +                * If a protection key exception occurs it could be
> > because a PKS test
> > +                * is running.  If so, pks_test_callback() will clear
> > the protection
> > +                * mechanism and return true to indicate the fault
> > was handled.
> > +                */
> > +               if (pks_test_callback())
> > +                       return;
> 
> Why do we need both this and pks_handle_key_fault()?

I debated this.  And I convinced myself that it was worth the extra code.

For this series, when testing pks_handle_key_fault() this may get called if
something goes wrong.  And when the test code is not configured it is a no-op.
So I don't see any harm in keeping this as a general handler.

I mentioned this when adding pks_handle_key_fault().[1]  I could make a note of
it in this patch if that would help.

Ira

[1] https://lore.kernel.org/lkml/20220127175505.851391-30-ira.weiny@intel.com/

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 30/44] mm/pkeys: Test setting a PKS key in a custom fault callback
  2022-02-01 17:42   ` Edgecombe, Rick P
@ 2022-02-11 20:44     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-11 20:44 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: hpa, Williams, Dan J, dave.hansen, Yu, Fenghua, linux-kernel

On Tue, Feb 01, 2022 at 09:42:32AM -0800, Edgecombe, Rick P wrote:
> On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> > +#define RUN_FAULT_ABANDON      5
> 
> The tests still call this operation "abandon" all throughout, but the
> operation got renamed in the kernel. Probably should rename it in the
> tests too.

Thanks...  I thought I had changed all the names.  Missed that one.

s/RUN_FAULT_ABANDON/RUN_FAULT_CALLBACK

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS
  2022-02-09  5:34           ` Ira Weiny
@ 2022-02-14 19:20             ` Dave Hansen
  2022-02-14 23:03               ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-02-14 19:20 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On 2/8/22 21:34, Ira Weiny wrote:
>>> In other words, there are two things that must happen before the code
>>> gets compiled in:
>>>
>>> 1. Arch support
>>> 2. One or more features to use the arch support
>> Yes.  I really think we are both say the same thing with different words.
> Is the following more clear?
> 
> <commit>
> 
> PKS is only useful to kernel consumers and is only available on some
> architectures.  If no kernel consumers are configured or PKS support is
> not available the PKS code can be eliminated from the compile.
> 
> Define a Kconfig structure which allows kernel consumers to detect
> architecture support (ARCH_HAS_SUPERVISOR_PKEYS) and, if available,
> indicate that PKS should be compiled in (ARCH_ENABLE_SUPERVISOR_PKEYS).
> 
> In this patch ARCH_ENABLE_SUPERVISOR_PKEYS remains off until the first
> kernel consumer sets it.

It's a bit more clear.  I wish it was more clear about the problem.  I
think it would be well-served to add some specifics and clarify the
*problem*.  Maybe something like:

== Problem ==

PKS support is provided by core x86 architecture code.  Its consumers,
however, may be far-flung device drivers like NVDIMM support.  The PKS
core architecture code is dead weight without a consumer.

--- maybe add one example ---

== Solution ==

Avoid even compiling in the core PKS code if there are no consumers.

== Details ==

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS
  2022-02-14 19:20             ` Dave Hansen
@ 2022-02-14 23:03               ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-14 23:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Mon, Feb 14, 2022 at 11:20:20AM -0800, Dave Hansen wrote:
> On 2/8/22 21:34, Ira Weiny wrote:
> >>> In other words, there are two things that must happen before the code
> >>> gets compiled in:
> >>>
> >>> 1. Arch support
> >>> 2. One or more features to use the arch support
> >> Yes.  I really think we are both say the same thing with different words.
> > Is the following more clear?
> > 
> > <commit>
> > 
> > PKS is only useful to kernel consumers and is only available on some
> > architectures.  If no kernel consumers are configured or PKS support is
> > not available the PKS code can be eliminated from the compile.
> > 
> > Define a Kconfig structure which allows kernel consumers to detect
> > architecture support (ARCH_HAS_SUPERVISOR_PKEYS) and, if available,
> > indicate that PKS should be compiled in (ARCH_ENABLE_SUPERVISOR_PKEYS).
> > 
> > In this patch ARCH_ENABLE_SUPERVISOR_PKEYS remains off until the first
> > kernel consumer sets it.
> 
> It's a bit more clear.  I wish it was more clear about the problem.  I
> think it would be well-served to add some specifics and clarify the
> *problem*.  Maybe something like:

;-(  Ok.

> 
> == Problem ==
> 
> PKS support is provided by core x86 architecture code.  Its consumers,
> however, may be far-flung device drivers like NVDIMM support.  The PKS
> core architecture code is dead weight without a consumer.

This is not the whole story though.  The far-flung device drivers don't need to
compile their PKS code, or may not be able to support a feature, if
ARCH_HAS_SUPERVISOR_PKEYS is not set.  Also, they may wish to chose a different
implementation for the same functionality if available.  So
ARCH_HAS_SUPERVISOR_PKEYS can affect more than just if PKS code is compiled.
It could affect how drivers chose to implement some higher level feature.

> 
> --- maybe add one example ---
> 
> == Solution ==
> 
> Avoid even compiling in the core PKS code if there are no consumers.

And allow users to avoid compiling PKS code on architectures which
don't support PKS.  Or chose another implementation if possible.

I'll try again.

<msg>

Consumers wishing to implement additional protections on memory pages may be
able to use PKS.  However, PKS is only available on some architectures.

In addition, PKS code, both in the core and in these consumers would be dead
code without PKS being both available and used.  Therefore, if no kernel
consumers are configured or PKS support is not available all the PKS code can
be eliminated from the compile.

Avoid using PKS if the architecture does not support it.  Furthermore,
avoid compiling any PKS code if their are no consumers configured to use 
it.

Define ARCH_HAS_SUPERVISOR_PKEYS to detect architecture support and
define ARCH_ENABLE_SUPERVISOR_PKEYS to indicate the core should compile
in support.

In this patch ARCH_ENABLE_SUPERVISOR_PKEYS remains off until the first
kernel consumer sets it.

</msg>


Ira

> 
> == Details ==

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 16/44] mm/pkeys: Introduce pks_mk_readwrite()
  2022-01-31 23:10   ` Edgecombe, Rick P
@ 2022-02-18  2:22     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-18  2:22 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: hpa, Williams, Dan J, dave.hansen, Yu, Fenghua, linux-kernel

On Mon, Jan 31, 2022 at 03:10:39PM -0800, Edgecombe, Rick P wrote:
> On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> > +void pks_update_protection(int pkey, u32 protection)
> > +{
> 
> I don't know if this matters too much, but the type of a pkey is either
> int or u16 across this series and PKU. But it's only possibly a 4 bit
> value.

I was settling on 'int' because the PKRU code uses int a lot.

That said, PKRU is a bit more complicated; x86 is 4 bits, powerpc is 5 bits,
and I see 4 different types for pkey [int, u16, u32, s16].

The signed values are used to mean 'key or error' in a couple of places.  Which
leaves 'int' as a convenient choice over 's16' IMO.  The use of u32 and u16
seems arbitrary.  Both should be plenty big for generic core code.

> Seems the smallest that would fit is char. Why use one over the
> other?
> 
> Also, why u32 for protection here? The whole pkrs value containing the
> bits for all keys is 32 bits, but per key there is only room ever for 2
> bits, right?

Correct but I'm not sure anything would be saved by declaring u8.  Regardless
I've changed it.

> 
> It would be nice to be consistent and have a reason, but again, I don't
> know if makes any real difference.

I was consistent in the core code with 'int'.  I'll look at cleaning up some of
the PKRU code but I think that is a separate series from this one.

For this series I'll standardize on u8 because u16 is also too big.  I have
seen one place where it would be nice to have a type of unsigned to check the
bounds of the pkey.  So you have a valid point that following the PKRU code was
less than ideal.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 16/44] mm/pkeys: Introduce pks_mk_readwrite()
  2022-02-01 17:40   ` Dave Hansen
@ 2022-02-18  4:39     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-18  4:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Tue, Feb 01, 2022 at 09:40:06AM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> > +static inline void pks_mk_readwrite(int pkey)
> > +{
> > +	pks_update_protection(pkey, PKEY_READ_WRITE);
> > +}
> 
> I don't really like the "mk" terminology in here.  Maybe it's from
> dealing with the PTE helpers, but "mk" to me means that it won't do
> anything observable by itself.  We're also not starved for space here,
> and it's really odd to abbreviate "make->mk" but not do "readwrite->rw".
> 
> This really is going off and changing a register value.  I think:
> 
> 	pks_set_readwrite()

Ok  For completeness I'm changing the pgmap_mk_* calls to match; pgmap_set_*.

> 
> would be fine.  This starts to get a bit redundant, but looks fine too:
> 
> 	pks_set_key_readwrite()

Yes I think that is a bit to verbose.

I think pks_set_xxx() reads nicely.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 19/44] mm/pkeys: PKS Testing, add pks_mk_*() tests
  2022-02-01 17:45   ` Dave Hansen
@ 2022-02-18  5:34     ` Ira Weiny
  2022-02-18 15:28       ` Dave Hansen
  0 siblings, 1 reply; 145+ messages in thread
From: Ira Weiny @ 2022-02-18  5:34 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Tue, Feb 01, 2022 at 09:45:03AM -0800, Dave Hansen wrote:
> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> >  bool pks_test_callback(void)
> >  {
> > -	return false;
> > +	bool armed = (test_armed_key != 0);
> > +
> > +	if (armed) {
> > +		pks_mk_readwrite(test_armed_key);
> > +		fault_cnt++;
> > +	}
> > +
> > +	return armed;
> > +}
> 
> Where's the locking for all this?  I don't think we need anything fancy,
> but is there anything preventing the test from being started from
> multiple threads at the same time?  I think a simple global test mutex
> would probably suffice.

Good idea.  Generally I don't see that happening but it is good to be safe.

> 
> Also, pks_test_callback() needs at least a comment or two about what
> it's doing.

The previous patch which adds this call in the fault handler contains the
following comment which is in the final code:

/*
 * pks_test_callback() is called by the fault handler to indicate it saw a pkey
 * fault.
 *
 * NOTE: The callback is responsible for clearing any condition which would
 * cause the fault to re-trigger.
 */

Would you like more comments within the function?

> 
> Does this work if you have a test armed and then you get an unrelated
> PKS fault on another CPU?  I think this will disarm the test from the
> unrelated thread.

This code will detect a false fault.  But the other unrelated fault will work
correctly.

I've debated if the test code should use a specific fault callback...  :-/
That breaks my test which iterates all keys...  but would fix this problem.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 26/44] x86/fault: Print PKS MSR on fault
  2022-02-01 18:13   ` Edgecombe, Rick P
@ 2022-02-18  6:01     ` Ira Weiny
  2022-02-18 17:28       ` Edgecombe, Rick P
  0 siblings, 1 reply; 145+ messages in thread
From: Ira Weiny @ 2022-02-18  6:01 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: hpa, Williams, Dan J, dave.hansen, Yu, Fenghua, linux-kernel

On Tue, Feb 01, 2022 at 10:13:40AM -0800, Edgecombe, Rick P wrote:
> On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> > +       if (error_code & X86_PF_PK)
> > +               pks_dump_fault_info(regs);
> > +
> 
> If the kernel makes an errant accesses to a userspace address with PKU
> enabled and the usersapce page marked AD, it should oops and get here,
> but will the X86_PF_PK bit be set even if smap is the real cause? Per
> the SDM, it sounds like it would:
> "
> For accesses to user-mode addresses, the flag is set if
> (1) CR4.PKE = 1;
> (2) the linear address has protection key i; and
> (3) the PKRU register (see Section 4.6.2) is such that either
>         (a) ADi = 1; or
>         (b) the following all hold:
>                 (i) WDi = 1;
>                 (ii) the access is a write access; and
>                 (iii) either CR0.WP = 1 or the access causing the
>                       page-fault exception was a user-mode access.
> "
> 
> ...and then this somewhat confusingly dumps the pks register. Is that
> the real behavior?

Are you suggesting the PKRU should be printed instead or in addition to the
PKS?

AFAICS this really should not present a problem even if the fault is due to a
user pkey violation.  It is simply extra information.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 20/44] mm/pkeys: Add PKS test for context switching
  2022-02-01 17:47   ` Edgecombe, Rick P
  2022-02-01 19:52     ` Edgecombe, Rick P
@ 2022-02-18  6:02     ` Ira Weiny
  1 sibling, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-18  6:02 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: hpa, Williams, Dan J, dave.hansen, Yu, Fenghua, linux-kernel

On Tue, Feb 01, 2022 at 09:47:14AM -0800, Edgecombe, Rick P wrote:
> On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> >  lib/pks/pks_test.c                     |  74 +++++++++++
> 
> Since this only tests a specific operation of pks, should it be named
> more specifically? Or it might be handy if it ran all the PKS tests,
> even though the others can be run directly.

I've been thinking the same thing too.  I have just not gotten around to it
yet.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 20/44] mm/pkeys: Add PKS test for context switching
  2022-02-01 19:52     ` Edgecombe, Rick P
@ 2022-02-18  6:03       ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-18  6:03 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: hpa, Williams, Dan J, dave.hansen, Yu, Fenghua, linux-kernel

On Tue, Feb 01, 2022 at 11:52:44AM -0800, Edgecombe, Rick P wrote:
> On Tue, 2022-02-01 at 09:47 -0800, Edgecombe, Richard P wrote:
> > On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> > >  lib/pks/pks_test.c                     |  74 +++++++++++
> >
> > Since this only tests a specific operation of pks, should it be named
> > more specifically? Or it might be handy if it ran all the PKS tests,
> > even though the others can be run directly.
> 
> Oops, I meant "tools/testing/selftests/x86/test_pks.c"

Yea...  me too!  ;-)
Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 19/44] mm/pkeys: PKS Testing, add pks_mk_*() tests
  2022-02-18  5:34     ` Ira Weiny
@ 2022-02-18 15:28       ` Dave Hansen
  2022-02-18 17:25         ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-02-18 15:28 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On 2/17/22 21:34, Ira Weiny wrote:
> On Tue, Feb 01, 2022 at 09:45:03AM -0800, Dave Hansen wrote:
>> On 1/27/22 09:54, ira.weiny@intel.com wrote:
>>>  bool pks_test_callback(void)
>>>  {
>>> -	return false;
>>> +	bool armed = (test_armed_key != 0);
>>> +
>>> +	if (armed) {
>>> +		pks_mk_readwrite(test_armed_key);
>>> +		fault_cnt++;
>>> +	}
>>> +
>>> +	return armed;
>>> +}
>>
>> Where's the locking for all this?  I don't think we need anything fancy,
>> but is there anything preventing the test from being started from
>> multiple threads at the same time?  I think a simple global test mutex
>> would probably suffice.
> 
> Good idea.  Generally I don't see that happening but it is good to be safe.

I'm not sure what you mean.

In the kernel, we always program as if userspace is out to get us.  If
userspace can possibly do something to confuse the kernel, it will.  It
might be malicious or incompetent, but it will happen.

This isn't really a "good to be safe" kind of thing.  Kernel code must
*be* safe.

>> Also, pks_test_callback() needs at least a comment or two about what
>> it's doing.
> 
> The previous patch which adds this call in the fault handler contains the
> following comment which is in the final code:
> 
> /*
>  * pks_test_callback() is called by the fault handler to indicate it saw a pkey
>  * fault.
>  *
>  * NOTE: The callback is responsible for clearing any condition which would
>  * cause the fault to re-trigger.
>  */
> 
> Would you like more comments within the function?

Ahh, it just wasn't in the context.

Looking at this again, I don't really like the name "callback" is almost
always a waste of bytes.  Imagine this was named something like:

	pks_test_induced_fault();

... and had a comment like:

/*
 * Ensure that the fault handler does not treat
 * test-induced faults as actual errors.
 */

>> Does this work if you have a test armed and then you get an unrelated
>> PKS fault on another CPU?  I think this will disarm the test from the
>> unrelated thread.
> 
> This code will detect a false fault.  

That's a bug that's going to get fixed, right? ;)


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 19/44] mm/pkeys: PKS Testing, add pks_mk_*() tests
  2022-02-18 15:28       ` Dave Hansen
@ 2022-02-18 17:25         ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-18 17:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dave Hansen, H. Peter Anvin, Dan Williams, Fenghua Yu,
	Rick Edgecombe, linux-kernel

On Fri, Feb 18, 2022 at 07:28:04AM -0800, Dave Hansen wrote:
> On 2/17/22 21:34, Ira Weiny wrote:
> > On Tue, Feb 01, 2022 at 09:45:03AM -0800, Dave Hansen wrote:
> >> On 1/27/22 09:54, ira.weiny@intel.com wrote:
> >>>  bool pks_test_callback(void)
> >>>  {
> >>> -	return false;
> >>> +	bool armed = (test_armed_key != 0);
> >>> +
> >>> +	if (armed) {
> >>> +		pks_mk_readwrite(test_armed_key);
> >>> +		fault_cnt++;
> >>> +	}
> >>> +
> >>> +	return armed;
> >>> +}
> >>
> >> Where's the locking for all this?  I don't think we need anything fancy,
> >> but is there anything preventing the test from being started from
> >> multiple threads at the same time?  I think a simple global test mutex
> >> would probably suffice.
> > 
> > Good idea.  Generally I don't see that happening but it is good to be safe.
> 
> I'm not sure what you mean.
> 
> In the kernel, we always program as if userspace is out to get us.  If
> userspace can possibly do something to confuse the kernel, it will.  It
> might be malicious or incompetent, but it will happen.
> 
> This isn't really a "good to be safe" kind of thing.  Kernel code must
> *be* safe.

Yes

> 
> >> Also, pks_test_callback() needs at least a comment or two about what
> >> it's doing.
> > 
> > The previous patch which adds this call in the fault handler contains the
> > following comment which is in the final code:
> > 
> > /*
> >  * pks_test_callback() is called by the fault handler to indicate it saw a pkey
> >  * fault.
> >  *
> >  * NOTE: The callback is responsible for clearing any condition which would
> >  * cause the fault to re-trigger.
> >  */
> > 
> > Would you like more comments within the function?
> 
> Ahh, it just wasn't in the context.
> 
> Looking at this again, I don't really like the name "callback" is almost
> always a waste of bytes.  Imagine this was named something like:
> 
> 	pks_test_induced_fault();
> 
> ... and had a comment like:
> 
> /*
>  * Ensure that the fault handler does not treat
>  * test-induced faults as actual errors.
>  */

Ok.  At this point this may go away depending on how I resolve the ability to
test all the keys.  pks_test_callback() was critical for that feature without
introducing a bunch of ugly test code in pks-keys.h and pkeys.c.

> 
> >> Does this work if you have a test armed and then you get an unrelated
> >> PKS fault on another CPU?  I think this will disarm the test from the
> >> unrelated thread.
> > 
> > This code will detect a false fault.  
> 
> That's a bug that's going to get fixed, right? ;)

Yep.  Not sure how at the moment.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 26/44] x86/fault: Print PKS MSR on fault
  2022-02-18  6:01     ` Ira Weiny
@ 2022-02-18 17:28       ` Edgecombe, Rick P
  2022-02-18 20:20         ` Dave Hansen
  0 siblings, 1 reply; 145+ messages in thread
From: Edgecombe, Rick P @ 2022-02-18 17:28 UTC (permalink / raw)
  To: Weiny, Ira; +Cc: hpa, Williams, Dan J, linux-kernel, Yu, Fenghua, dave.hansen

On Thu, 2022-02-17 at 22:01 -0800, Ira Weiny wrote:
> Are you suggesting the PKRU should be printed instead or in addition
> to the
> PKS?

Well I was just thinking that PKRS should only be printed if it's an
access via a supervisor pte. I guess printing PKRU for user faults
could be more complete. I'm not sure how PKRU could be useful though, I
can only think if smap was disabled and there was an errant access.

> 
> AFAICS this really should not present a problem even if the fault is
> due to a
> user pkey violation.  It is simply extra information.

Yea there is still enough information to decode the fault, but it could
be misleading at a glance. You're right it's not a big deal, but if it
were me I would fix it.

Just looking more, you would have to look up the PTE like is happening
for PF_INSTR. That code would need to be tweaked a little to also
include X86_PF_PK.

On the other hand, someone once told me to avoid touching the oops code
because if it gets broken it makes debugging very hard. It's better to
be reliable than fancy for that stuff.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 26/44] x86/fault: Print PKS MSR on fault
  2022-02-18 17:28       ` Edgecombe, Rick P
@ 2022-02-18 20:20         ` Dave Hansen
  2022-02-18 20:54           ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Dave Hansen @ 2022-02-18 20:20 UTC (permalink / raw)
  To: Edgecombe, Rick P, Weiny, Ira
  Cc: hpa, Williams, Dan J, linux-kernel, Yu, Fenghua, dave.hansen

On 2/18/22 09:28, Edgecombe, Rick P wrote:
> On Thu, 2022-02-17 at 22:01 -0800, Ira Weiny wrote:
>> Are you suggesting the PKRU should be printed instead or in addition
>> to the
>> PKS?
> Well I was just thinking that PKRS should only be printed if it's an
> access via a supervisor pte.

That's not *wrong* per se, but it's not what we do for PKU:

        if (cpu_feature_enabled(X86_FEATURE_OSPKE))
                printk("%sPKRU: %08x\n", log_lvl, read_pkru());

If the feature is enabled, we print the register.  We don't try to be
fancy and decide if it's relevant to the oops.  Why don't you just stick
PKRS on the same line as PKRU whenever it's supported?

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 26/44] x86/fault: Print PKS MSR on fault
  2022-02-18 20:20         ` Dave Hansen
@ 2022-02-18 20:54           ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-18 20:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Edgecombe, Rick P, hpa, Williams, Dan J, linux-kernel, Yu,
	Fenghua, dave.hansen

On Fri, Feb 18, 2022 at 12:20:58PM -0800, Dave Hansen wrote:
> On 2/18/22 09:28, Edgecombe, Rick P wrote:
> > On Thu, 2022-02-17 at 22:01 -0800, Ira Weiny wrote:
> >> Are you suggesting the PKRU should be printed instead or in addition
> >> to the
> >> PKS?
> > Well I was just thinking that PKRS should only be printed if it's an
> > access via a supervisor pte.
> 
> That's not *wrong* per se, but it's not what we do for PKU:
> 
>         if (cpu_feature_enabled(X86_FEATURE_OSPKE))
>                 printk("%sPKRU: %08x\n", log_lvl, read_pkru());
> 
> If the feature is enabled, we print the register.  We don't try to be
> fancy and decide if it's relevant to the oops.  Why don't you just stick
> PKRS on the same line as PKRU whenever it's supported?

Ah good point.  I'll do that.

Thanks,
Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 20/44] mm/pkeys: Add PKS test for context switching
  2022-02-01 17:43   ` Edgecombe, Rick P
@ 2022-02-22 21:42     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-22 21:42 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: hpa, Williams, Dan J, dave.hansen, Yu, Fenghua, linux-kernel

On Tue, Feb 01, 2022 at 09:43:17AM -0800, Edgecombe, Rick P wrote:
> On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> > +int check_context_switch(int cpu)
> > +{
> > +       int switch_done[2];
> > +       int setup_done[2];
> > +       cpu_set_t cpuset;
> > +       char result[32];
> > +       int rc = 0;
> > +       pid_t pid;
> > +       int fd;
> > +
> > +       CPU_ZERO(&cpuset);
> > +       CPU_SET(cpu, &cpuset);
> > +       /*
> > +        * Ensure the two processes run on the same CPU so that they
> > go through
> > +        * a context switch.
> > +        */
> > +       sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset);
> > +
> > +       if (pipe(setup_done)) {
> > +               printf("ERROR: Failed to create pipe\n");
> > +               return -1;
> > +       }
> > +       if (pipe(switch_done)) {
> > +               printf("ERROR: Failed to create pipe\n");
> > +               return -1;
> > +       }
> > +
> > +       pid = fork();
> > +       if (pid == 0) {
> > +               char done = 'y';
> > +
> > +               fd = open(PKS_TEST_FILE, O_RDWR);
> > +               if (fd < 0) {
> > +                       printf("ERROR: cannot open %s\n",
> > PKS_TEST_FILE);
> > +                       return -1;
> 
> When this happens, the error is printed, but the parent process just
> hangs forever. Might make it hard to script running all the selftests.

Good point.  I've fixed this up.

> 
> Also, the other x86 selftests mostly use [RUN], [INFO], [OK], [FAIL],
> [SKIP] and [OK] in their print statements. Probably should stick to the
> pattern across all the print statements. This is probably a "[SKIP]".
> Just realized I've omitted the "[]" in the CET series too.

Thanks, fixed.

Ira

> 
> > +               }
> > +
> > +               cpu = sched_getcpu();
> > +               printf("Child running on cpu %d...\n", cpu);
> > +
> > +               /* Allocate and run test. */
> > +               write(fd, RUN_SINGLE, 1);
> > +
> > +               /* Arm for context switch test */
> > +               write(fd, ARM_CTX_SWITCH, 1);
> > +
> > +               printf("   tell parent to go\n");
> > +               write(setup_done[1], &done, sizeof(done));
> > +
> > +               /* Context switch out... */
> > +               printf("   Waiting for parent...\n");
> > +               read(switch_done[0], &done, sizeof(done));
> > +
> > +               /* Check msr restored */
> > +               printf("Checking result\n");
> > +               write(fd, CHECK_CTX_SWITCH, 1);
> > +
> > +               read(fd, result, 10);
> > +               printf("   #PF, context switch, pkey allocation and
> > free tests: %s\n", result);
> > +               if (!strncmp(result, "PASS", 10)) {
> > +                       rc = -1;
> > +                       done = 'F';
> > +               }
> > +
> > +               /* Signal result */
> > +               write(setup_done[1], &done, sizeof(done));
> > +       } else {
> > +               char done = 'y';
> > +
> > +               read(setup_done[0], &done, sizeof(done));
> > +               cpu = sched_getcpu();
> > +               printf("Parent running on cpu %d\n", cpu);
> > +
> > +               fd = open(PKS_TEST_FILE, O_RDWR);
> > +               if (fd < 0) {
> > +                       printf("ERROR: cannot open %s\n",
> > PKS_TEST_FILE);
> > +                       return -1;
> > +               }
> > +
> > +               /* run test with the same pkey */
> > +               write(fd, RUN_SINGLE, 1);
> > +
> > +               printf("   Signaling child.\n");
> > +               write(switch_done[1], &done, sizeof(done));
> > +
> > +               /* Wait for result */
> > +               read(setup_done[0], &done, sizeof(done));
> > +               if (done == 'F')
> > +                       rc = -1;
> > +       }
> 
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 38/44] memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls
  2022-02-04 18:35   ` Dan Williams
  2022-02-05  0:09     ` Ira Weiny
@ 2022-02-22 22:05     ` Ira Weiny
  1 sibling, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-02-22 22:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Fri, Feb 04, 2022 at 10:35:59AM -0800, Dan Williams wrote:
> On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> >
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > Users will need a way to flag valid access to pages which have been
> > protected with PGMAP protections.  Provide this by defining pgmap_mk_*()
> > accessor functions.
> 
> I find the ambiguous use of "Users" not helpful to set the context. How about:
> 
> A thread that wants to access memory protected by PGMAP protections
> must first enable access, and then disable access when it is done.
> 
> >
> > pgmap_mk_{readwrite|noaccess}() take a struct page for convenience.
> > They determine if the page is protected by dev_pagemap protections.  If
> > so, they perform the requested operation.
> >
> > In addition, the lower level __pgmap_* functions are exported.  They
> > take the dev_pagemap object directly for internal users who have
> > knowledge of the of the dev_pagemap.
> >
> > All changes in the protections must be through the above calls.  They
> > abstract the protection implementation (currently the PKS api) from the
> > upper layer users.
> >
> > Furthermore, the calls are nestable by the use of a per task reference
> > count.  This ensures that the first call to re-enable protection does
> > not 'break' the last access of the device memory.
> >
> > Access to device memory during exceptions (#PF) is expected only from
> > user faults.  Therefore there is no need to maintain the reference count
> > when entering or exiting exceptions.  However, reference counting will
> > occur during the exception.  Recall that protection is automatically
> > enabled during exceptions by the PKS core.[1]
> >
> > NOTE: It is not anticipated that any code paths will directly nest these
> > calls.  For this reason multiple reviewers, including Dan and Thomas,
> > asked why this reference counting was needed at this level rather than
> > in a higher level call such as kmap_{atomic,local_page}().  The reason
> > is that pgmap_mk_readwrite() could nest with regards to other callers of
> > pgmap_mk_*() such as kmap_{atomic,local_page}().  Therefore push this
> > reference counting to the lower level and just ensure that these calls
> > are nestable.
> 
> I still don't think that explains why task struct has a role to play
> here, see below.
> 
> Another missing bit of clarification, maybe I missed it, is why are
> the protections toggled between read-write and noaccess. For
> stray-write protection toggling between read-write and read-only is
> sufficient. I can imagine speculative execution and debug rationales
> for noaccess, but those should be called out explicitly.
> 

I'll clarify in the commit message but it is very simply providing consistent
behavior for kmap'ing a page before and after this series.  kmap's allows for
both read and write access.

I know it was discussed to introduce the complexity of different mappings for
read vs write.  But I think that is something which could be added later rather
than being a requirement of this series.

[snip]

> 
> The naming, which I had a hand in, is not aging well. When I see "mk"
> I expect it to be building some value like a page table entry that
> will be installed later. These helpers are directly enabling and
> disabling access and are meant to be called symmetrically. So I would
> expect symmetric names like:
> 
> pgmap_enable_access()
> pgmap_disable_access()

For this Dave requested s/pks_mk_*/pks_set_*/.  So I've followed that
convention here.  New names are pgmap_set_*().  Although I'm not sure I'm happy
with that name now...

Enable may sound better but we had used 'enable_access' before and it got all
confusing for some reason...  :-/

pgmap_set_noaccess()
pgmap_set_readwrite()

Seems good I think.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 33/44] memremap_pages: Introduce pgmap_protection_available()
  2022-02-04 16:19   ` Dan Williams
@ 2022-02-28 16:59     ` Ira Weiny
  2022-03-01 15:56       ` Ira Weiny
  0 siblings, 1 reply; 145+ messages in thread
From: Ira Weiny @ 2022-02-28 16:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Fri, Feb 04, 2022 at 08:19:43AM -0800, Dan Williams wrote:
> On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:

[snip]

> > @@ -63,6 +64,16 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
> >  }
> >  #endif /* CONFIG_DEV_PAGEMAP_OPS */
> >
> > +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> > +
> > +bool pgmap_protection_available(void)
> > +{
> > +       return pks_available();
> > +}
> > +EXPORT_SYMBOL_GPL(pgmap_protection_available);
> 
> Any reason this was chosen to be an out-of-line function? Doesn't this
> defeat the performance advantages of static_cpu_has()?

Unfortunately, yes.  pkeys.h includes mm.h which means I can't include pkeys.h
here in mm.h.

Let me see what I can do.  In patch 11 I created pks-keys.h.  Let me see if I
can leverage that header instead of pkeys.h.

When I created that header I was thinking that the user and supervisor pkey
functions may need even more separation in the headers but I was fearful of
putting too much in pks-keys.h because it was created to avoid conflicts in
asm/processor.h.  Looking at it again I think pks_available() may be ok in
pks-keys.h.

Ira

> 
> > +
> > +#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
> > +
> >  static void pgmap_array_delete(struct range *range)
> >  {
> >         xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 30/44] mm/pkeys: Test setting a PKS key in a custom fault callback
  2022-02-01  0:55   ` Edgecombe, Rick P
@ 2022-03-01 15:39     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-03-01 15:39 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: hpa, Williams, Dan J, dave.hansen, Yu, Fenghua, linux-kernel

On Mon, Jan 31, 2022 at 04:55:47PM -0800, Edgecombe, Rick P wrote:
> On Thu, 2022-01-27 at 09:54 -0800, ira.weiny@intel.com wrote:
> > Add a test which does this.
> >
> >         $ echo 5 > /sys/kernel/debug/x86/run_pks
> >         $ cat /sys/kernel/debug/x86/run_pks
> >         PASS
> 
> Hmm, when I run this on qemu TCG, I get:
> 
> root@(none):/# echo 5 > /sys/kernel/debug/x86/run_pks
> [   29.438159] pks_test: Failed to see the callback
> root@(none):/# cat /sys/kernel/debug/x86/run_pks
> FAIL
> 
> I think it's a problem with the test though. The generated code is not
> expecting fault_callback_ctx.callback_seen to get changed in the
> exception. The following fixed it for me:
> 
> diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
> index 1528df0bb283..d979d2afe921 100644
> --- a/lib/pks/pks_test.c
> +++ b/lib/pks/pks_test.c
> @@ -570,6 +570,7 @@ static bool run_fault_clear_test(void)
>         /* fault */
>         memcpy(test_page, ctx->data, 8);
> 
> +       barrier();
>         if (!fault_callback_ctx.callback_seen) {
>                 pr_err("Failed to see the callback\n");
>                 rc = false;
> 
> But, I wonder if volatile is also needed on the read to be fully
> correct. I usually have to consult the docs when I deal with that
> stuff...

I was not able to reproduce this.  However, I've done a lot of reading and I
think you are correct that the barrier is needed.  I thought WRITE_ONCE was
sufficient and I had used it in other calls but I missed it here.

As part of the test rework I've added a call to barrier() for all the tests.
In addition I've simplified, and hopefully clarified, which variables are
being shared with the fault handler.

Thanks for the testing and review!
Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 33/44] memremap_pages: Introduce pgmap_protection_available()
  2022-02-28 16:59     ` Ira Weiny
@ 2022-03-01 15:56       ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-03-01 15:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Mon, Feb 28, 2022 at 08:59:44AM -0800, Ira Weiny wrote:
> On Fri, Feb 04, 2022 at 08:19:43AM -0800, Dan Williams wrote:
> > On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> 
> [snip]
> 
> > > @@ -63,6 +64,16 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
> > >  }
> > >  #endif /* CONFIG_DEV_PAGEMAP_OPS */
> > >
> > > +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> > > +
> > > +bool pgmap_protection_available(void)
> > > +{
> > > +       return pks_available();
> > > +}
> > > +EXPORT_SYMBOL_GPL(pgmap_protection_available);
> > 
> > Any reason this was chosen to be an out-of-line function? Doesn't this
> > defeat the performance advantages of static_cpu_has()?
> 
> Unfortunately, yes.  pkeys.h includes mm.h which means I can't include pkeys.h
> here in mm.h.
> 
> Let me see what I can do.  In patch 11 I created pks-keys.h.  Let me see if I
> can leverage that header instead of pkeys.h.
> 
> When I created that header I was thinking that the user and supervisor pkey
> functions may need even more separation in the headers but I was fearful of
> putting too much in pks-keys.h because it was created to avoid conflicts in
> asm/processor.h.  Looking at it again I think pks_available() may be ok in
> pks-keys.h.

Ok I've reworked the series to allow for this.  However, pks-keys.h was not
sufficient.  That header needs to be specific to the definition of the keys
themselves (hence the name).

In order to facilitate this change I've introduced another header linux/pks.h
which separates out the supervisor specific calls from the user pkeys calls.
It worked out well and I think makes a lot of sense due to the different
functionality.  But I'm pretty bad at names so I'm open to changing the name of
the header if 'pks.h' seems too generic.

Ira

> 
> Ira
> 
> > 
> > > +
> > > +#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
> > > +
> > >  static void pgmap_array_delete(struct range *range)
> > >  {
> > >         xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
> > > --
> > > 2.31.1
> > >

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 42/44] dax: Stray access protection for dax_direct_access()
  2022-02-04  5:19   ` Dan Williams
@ 2022-03-01 18:13     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-03-01 18:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Thu, Feb 03, 2022 at 09:19:58PM -0800, Dan Williams wrote:
> On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> >
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > dax_direct_access() provides a way to obtain the direct map address of
> > PMEM memory.  Coordinate PKS protection with dax_direct_access() of
> > protected devmap pages.
> >
> > Introduce 3 new dax_operation calls .map_protected .mk_readwrite and
> > .mk_noaccess. These 3 calls do not have to be implemented by the dax
> > provider if no protection is implemented.
> >
> > Threads of execution can use dax_mk_{readwrite,noaccess}() to relax the
> > protection of the dax device and allow direct use of the kaddr returned
> > from dax_direct_access().  The dax_mk_{readwrite,noaccess}() calls only
> > need to be used to guard actual access to the memory.  Other uses of
> > dax_direct_access() do not need to use these guards.
> >
> > For users who require a permanent address to the dax device such as the
> > DM write cache.  dax_map_protected() indicates that the dax device has
> > additional protections and that user should create it's own permanent
> > mapping of the memory.  Update the DM write cache code to create this
> > permanent mapping.
> >
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> [..]
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 9fc5f99a0ae2..261af298f89f 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -30,6 +30,10 @@ struct dax_operations {
> >                         sector_t, sector_t);
> >         /* zero_page_range: required operation. Zero page range   */
> >         int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
> > +
> > +       bool (*map_protected)(struct dax_device *dax_dev);
> > +       void (*mk_readwrite)(struct dax_device *dax_dev);
> > +       void (*mk_noaccess)(struct dax_device *dax_dev);
> 
> So the dax code just recently jettisoned -the >copy_{to,from}_iter()
> ops and it would be shame to grow more ops. Given that the
> implementation is pgmap generic I think all that is needed is way to
> go from a daxdev to a pgmap and then use the pgmap helpers directly
> rather than indirecting through the pmem driver just to get the pgmap.

Ok done.

dax_device now has knowledge of the pgmap which was pretty clean.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 37/44] memremap_pages: Set PKS PKey in PTEs if PGMAP_PROTECTIONS is requested
  2022-02-04 17:41   ` Dan Williams
@ 2022-03-01 18:15     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-03-01 18:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Fri, Feb 04, 2022 at 09:41:59AM -0800, Dan Williams wrote:
> On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> >
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > When the user requests protections the dev_pagemap mappings need to have
> > a PKEY set.
> >
> > Define devmap_protection_adjust_pgprot() to add the PKey to the page
> > protections.  Call it when PGMAP_PROTECTIONS is requested when remapping
> > pages.
> >
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > ---
> 
> Does this patch have a reason to exist independent of the patch that
> introduced devmap_protection_enable()?
> 
> Otherwise looks ok.

Just easier to review this specific change.  For V8 I split the patches up
quite a bit to be much more direct to 1 change/patch.  I think it worked out
well and I don't plan to merge much in V9 because as you say this change looks
good.  :-D

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 43/44] nvdimm/pmem: Enable stray access protection
  2022-02-04 21:10   ` Dan Williams
@ 2022-03-01 18:18     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-03-01 18:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Fri, Feb 04, 2022 at 01:10:53PM -0800, Dan Williams wrote:
> On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> >
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > Now that all valid kernel access' to PMEM have been annotated with
> > {__}pgmap_mk_{readwrite,noaccess}() PGMAP_PROTECTION is safe to enable
> > in the pmem layer.
> >
> > Implement the pmem_map_protected() and pmem_mk_{readwrite,noaccess}() to
> > communicate this memory has extra protection to the upper layers if
> > PGMAP_PROTECTION is specified.
> >
> > Internally, the pmem driver uses a cached virtual address,
> > pmem->virt_addr (pmem_addr).  Use __pgmap_mk_{readwrite,noaccess}()
> > directly when PGMAP_PROTECTION is active on the device.
> >
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> >
> > ---
> > Changes for V8
> >         Rebase to 5.17-rc1
> >         Remove global param
> >         Add internal structure which uses the pmem device and pgmap
> >                 device directly in the *_mk_*() calls.
> >         Add pmem dax ops callbacks
> >         Use pgmap_protection_available()
> >         s/PGMAP_PKEY_PROTECT/PGMAP_PROTECTION
> > ---
> >  drivers/nvdimm/pmem.c | 52 ++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 51 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index 58d95242a836..2afff8157233 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -138,6 +138,18 @@ static blk_status_t read_pmem(struct page *page, unsigned int off,
> >         return BLK_STS_OK;
> >  }
> >
> > +static void __pmem_mk_readwrite(struct pmem_device *pmem)
> > +{
> > +       if (pmem->pgmap.flags & PGMAP_PROTECTION)
> > +               __pgmap_mk_readwrite(&pmem->pgmap);
> > +}
> > +
> > +static void __pmem_mk_noaccess(struct pmem_device *pmem)
> > +{
> > +       if (pmem->pgmap.flags & PGMAP_PROTECTION)
> > +               __pgmap_mk_noaccess(&pmem->pgmap);
> > +}
> > +
> 
> Per previous feedback let's find a way for the pmem driver to stay out
> of the loop, and just let these toggles by pgmap generic operations.

I want to clarify.  Yes the pmem driver is now out of the dax driver loop.
However, these calls must remain because the pmem driver caches pmem->virt_addr
and uses that address to access the maps directly.

Therefore these specific calls need to remain for the pmem drivers internal
use.  In addition to the commit message I've added comments to the call sites
to clarify this fact.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 41/44] kmap: Ensure kmap works for devmap pages
  2022-02-04 21:07   ` Dan Williams
@ 2022-03-01 19:45     ` Ira Weiny
  2022-03-01 19:50       ` Ira Weiny
  2022-03-01 20:05       ` Dan Williams
  0 siblings, 2 replies; 145+ messages in thread
From: Ira Weiny @ 2022-03-01 19:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Fri, Feb 04, 2022 at 01:07:10PM -0800, Dan Williams wrote:
> On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> >
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > Users of devmap pages should not have to know that the pages they are
> > operating on are special.
> 
> How about get straight to the point without any ambiguous references:
> 
> Today, kmap_{local_page,atomic} handles granting access to HIGHMEM
> pages without the caller needing to know if the page is HIGHMEM, or
> not. Use that existing infrastructure to grant access to PKS/PGMAP
> access protected pages.

This sounds better.  Thanks.

> 
> > Co-opt the kmap_{local_page,atomic}() to mediate access to PKS protected
> > pages via the devmap facility.  kmap_{local_page,atomic}() are both
> > thread local mappings so they work well with the thread specific
> > protections available.
> >
> > kmap(), on the other hand, allows for global mappings to be established,
> > Which is incompatible with the underlying PKS facility.
> 
> Why is kmap incompatible with PKS? I know why, but this is a claim
> without evidence. If you documented that in a previous patch, there's
> no harm and copying and pasting into this one. A future git log user
> will thank you for not making them go to lore to try to find the one
> patch with the  details.

Good point.

> Extra credit for creating a PKS theory of
> operation document with this detail, unless I missed that?

Well...  I've documented and mentioned the thread-local'ness of PKS a lot but
I'm pretty close to all of this so it is hard for me to remember where and to
what degree that is documented.  I've already reworked the PKS documentation a
bit.  So I'll review that.

> 
> > For this reason
> > kmap() is not supported.  Rather than leave the kmap mappings to fault
> > at random times when users may access them,
> 
> Is that a problem?

No.

> This instrumentation is also insufficient for
> legitimate usages of page_address().

True.  Although with this protection those access' are no longer legitimate.
And it sounds like it may be worth putting a call in page_address() as well.

> Might as well rely on the kernel
> developer community being able to debug PKS WARN() splats back to the
> source because that will need to be done regardless, given kmap() is
> not the only source of false positive access violations.

I disagree but I'm happy to drop pgmap_protection_flag_invalid() if that is the
consensus.

The reason I disagree is that it is generally better to catch errors early
rather than later.  Furthermore, this does not change the permissions.  Which
means the actual invalid access will also get flagged at the point of use.
This allows more debugging information for the user.

Do you feel that strongly about removing pgmap_protection_flag_invalid()?

> 
> > call
> > pgmap_protection_flag_invalid() to show kmap() users the call stack of
> > where mapping was created.  This allows better debugging.
> >
> > This behavior is safe because neither of the 2 current DAX-capable
> > filesystems (ext4 and xfs) perform such global mappings.  And known
> > device drivers that would handle devmap pages are not using kmap().  Any
> > future filesystems that gain DAX support, or device drivers wanting to
> > support devmap protected pages will need to use kmap_local_page().
> >
> > Direct-map exposure is already mitigated by default on HIGHMEM systems
> > because by definition HIGHMEM systems do not have large capacities of
> > memory in the direct map.  And using kmap in those systems actually
> > creates a separate mapping.  Therefore, to reduce complexity HIGHMEM
> > systems are not supported.
> 
> It was only at the end of this paragraph did I understand why I was
> reading this paragraph. The change in topic was buried. I.e.
> 
> ---
> 
> Note: HIGHMEM support is mutually exclusive with PGMAP protection. The
> rationale is mainly to reduce complexity, but also because direct-map
> exposure is already mitigated by default on HIGHMEM systems  because
> by definition HIGHMEM systems do not have large capacities of memory
> in the direct map...

Sounds good.  Sorry about not being clear.

> 
> ---
> 
> That note and related change should probably go in the same patch that
> introduces CONFIG_DEVMAP_ACCESS_PROTECTION in the first place. It's an
> unrelated change to instrumenting kmap() to fail early, which again I
> don't think is strictly necessary.

I'm not sure about this.

Unfortunately I have not made the point of this patch clear.  This patch
is co-opting the highmem interface [kmap(), kmap_atomic(), and
kmap_local_page()] to support PKS protected mappings.

The global nature of the kmap() call is not supported and is special cased.
HIGHMEM systems are also not supported and special cased.

I'll try and clarify this in V9.

Ira

> 
> >
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Dave Hansen <dave.hansen@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> >
> > ---
> > Changes for V8
> >         Reword commit message
> > ---
> >  include/linux/highmem-internal.h | 5 +++++
> >  mm/Kconfig                       | 1 +
> >  2 files changed, 6 insertions(+)
> >
> > diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h
> > index 0a0b2b09b1b8..1a006558734c 100644
> > --- a/include/linux/highmem-internal.h
> > +++ b/include/linux/highmem-internal.h
> > @@ -159,6 +159,7 @@ static inline struct page *kmap_to_page(void *addr)
> >  static inline void *kmap(struct page *page)
> >  {
> >         might_sleep();
> > +       pgmap_protection_flag_invalid(page);
> >         return page_address(page);
> >  }
> >
> > @@ -174,6 +175,7 @@ static inline void kunmap(struct page *page)
> >
> >  static inline void *kmap_local_page(struct page *page)
> >  {
> > +       pgmap_mk_readwrite(page);
> >         return page_address(page);
> >  }
> >
> > @@ -197,6 +199,7 @@ static inline void __kunmap_local(void *addr)
> >  #ifdef ARCH_HAS_FLUSH_ON_KUNMAP
> >         kunmap_flush_on_unmap(addr);
> >  #endif
> > +       pgmap_mk_noaccess(kmap_to_page(addr));
> >  }
> >
> >  static inline void *kmap_atomic(struct page *page)
> > @@ -206,6 +209,7 @@ static inline void *kmap_atomic(struct page *page)
> >         else
> >                 preempt_disable();
> >         pagefault_disable();
> > +       pgmap_mk_readwrite(page);
> >         return page_address(page);
> >  }
> >
> > @@ -224,6 +228,7 @@ static inline void __kunmap_atomic(void *addr)
> >  #ifdef ARCH_HAS_FLUSH_ON_KUNMAP
> >         kunmap_flush_on_unmap(addr);
> >  #endif
> > +       pgmap_mk_noaccess(kmap_to_page(addr));
> >         pagefault_enable();
> >         if (IS_ENABLED(CONFIG_PREEMPT_RT))
> >                 migrate_enable();
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 67e0264acf7d..d537679448ae 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -779,6 +779,7 @@ config ZONE_DEVICE
> >  config DEVMAP_ACCESS_PROTECTION
> >         bool "Access protection for memremap_pages()"
> >         depends on NVDIMM_PFN
> > +       depends on !HIGHMEM
> >         depends on ARCH_HAS_SUPERVISOR_PKEYS
> >         select ARCH_ENABLE_SUPERVISOR_PKEYS
> >         default y
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 41/44] kmap: Ensure kmap works for devmap pages
  2022-03-01 19:45     ` Ira Weiny
@ 2022-03-01 19:50       ` Ira Weiny
  2022-03-01 20:05       ` Dan Williams
  1 sibling, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-03-01 19:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Tue, Mar 01, 2022 at 11:45:41AM -0800, Ira Weiny wrote:
> On Fri, Feb 04, 2022 at 01:07:10PM -0800, Dan Williams wrote:
> > On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> > >
> > > From: Ira Weiny <ira.weiny@intel.com>
> > >
> 
> > 
> > > Co-opt the kmap_{local_page,atomic}() to mediate access to PKS protected
> > > pages via the devmap facility.  kmap_{local_page,atomic}() are both
> > > thread local mappings so they work well with the thread specific
> > > protections available.
> > >
> > > kmap(), on the other hand, allows for global mappings to be established,
> > > Which is incompatible with the underlying PKS facility.
> > 
> > Why is kmap incompatible with PKS? I know why, but this is a claim
> > without evidence. If you documented that in a previous patch, there's
> > no harm and copying and pasting into this one. A future git log user
> > will thank you for not making them go to lore to try to find the one
> > patch with the  details.
> 
> Good point.
> 

FWIW, I just noticed the previous paragraph mentioned the PKS protections were
thread local.  I'll still reiterate and clarify here.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 41/44] kmap: Ensure kmap works for devmap pages
  2022-03-01 19:45     ` Ira Weiny
  2022-03-01 19:50       ` Ira Weiny
@ 2022-03-01 20:05       ` Dan Williams
  2022-03-01 23:03         ` Ira Weiny
  1 sibling, 1 reply; 145+ messages in thread
From: Dan Williams @ 2022-03-01 20:05 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Tue, Mar 1, 2022 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Fri, Feb 04, 2022 at 01:07:10PM -0800, Dan Williams wrote:
> > On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> > >
> > > From: Ira Weiny <ira.weiny@intel.com>
> > >
> > > Users of devmap pages should not have to know that the pages they are
> > > operating on are special.
> >
> > How about get straight to the point without any ambiguous references:
> >
> > Today, kmap_{local_page,atomic} handles granting access to HIGHMEM
> > pages without the caller needing to know if the page is HIGHMEM, or
> > not. Use that existing infrastructure to grant access to PKS/PGMAP
> > access protected pages.
>
> This sounds better.  Thanks.
>
> >
> > > Co-opt the kmap_{local_page,atomic}() to mediate access to PKS protected
> > > pages via the devmap facility.  kmap_{local_page,atomic}() are both
> > > thread local mappings so they work well with the thread specific
> > > protections available.
> > >
> > > kmap(), on the other hand, allows for global mappings to be established,
> > > Which is incompatible with the underlying PKS facility.
> >
> > Why is kmap incompatible with PKS? I know why, but this is a claim
> > without evidence. If you documented that in a previous patch, there's
> > no harm and copying and pasting into this one. A future git log user
> > will thank you for not making them go to lore to try to find the one
> > patch with the  details.
>
> Good point.
>
> > Extra credit for creating a PKS theory of
> > operation document with this detail, unless I missed that?
>
> Well...  I've documented and mentioned the thread-local'ness of PKS a lot but
> I'm pretty close to all of this so it is hard for me to remember where and to
> what degree that is documented.  I've already reworked the PKS documentation a
> bit.  So I'll review that.
>
> >
> > > For this reason
> > > kmap() is not supported.  Rather than leave the kmap mappings to fault
> > > at random times when users may access them,
> >
> > Is that a problem?
>
> No.

What I meant was how random is random and is it distinguishable from
direct page_address() usage where there is no explicit early failure
path?

>
> > This instrumentation is also insufficient for
> > legitimate usages of page_address().
>
> True.  Although with this protection those access' are no longer legitimate.
> And it sounds like it may be worth putting a call in page_address() as well.
>
> > Might as well rely on the kernel
> > developer community being able to debug PKS WARN() splats back to the
> > source because that will need to be done regardless, given kmap() is
> > not the only source of false positive access violations.
>
> I disagree but I'm happy to drop pgmap_protection_flag_invalid() if that is the
> consensus.
>
> The reason I disagree is that it is generally better to catch errors early
> rather than later.  Furthermore, this does not change the permissions.  Which
> means the actual invalid access will also get flagged at the point of use.
> This allows more debugging information for the user.
>
> Do you feel that strongly about removing pgmap_protection_flag_invalid()?

You haven't convinced me that it matters yet. Do you have an example
of a kmap() pointer dereference PKS splat where it's not clear from
the backtrace from the fault handler that a kmap path was involved?

At a minimum if it stays it seems like something that should be
wrapped by VM_WARN_ON_ONCE_PAGE() like other page relative memory
debugging extra checks that get disabled by CONFIG_DEBUG_VM, but the
assertion that "early is better" needs evidence that "later is too
ambiguous".

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 41/44] kmap: Ensure kmap works for devmap pages
  2022-03-01 20:05       ` Dan Williams
@ 2022-03-01 23:03         ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-03-01 23:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Tue, Mar 01, 2022 at 12:05:27PM -0800, Dan Williams wrote:
> On Tue, Mar 1, 2022 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote:
> >
> > On Fri, Feb 04, 2022 at 01:07:10PM -0800, Dan Williams wrote:
> > > On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> > > >
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > >
> > > > Users of devmap pages should not have to know that the pages they are
> > > > operating on are special.
> > >
> > > How about get straight to the point without any ambiguous references:
> > >
> > > Today, kmap_{local_page,atomic} handles granting access to HIGHMEM
> > > pages without the caller needing to know if the page is HIGHMEM, or
> > > not. Use that existing infrastructure to grant access to PKS/PGMAP
> > > access protected pages.
> >
> > This sounds better.  Thanks.
> >
> > >
> > > > Co-opt the kmap_{local_page,atomic}() to mediate access to PKS protected
> > > > pages via the devmap facility.  kmap_{local_page,atomic}() are both
> > > > thread local mappings so they work well with the thread specific
> > > > protections available.
> > > >
> > > > kmap(), on the other hand, allows for global mappings to be established,
> > > > Which is incompatible with the underlying PKS facility.
> > >
> > > Why is kmap incompatible with PKS? I know why, but this is a claim
> > > without evidence. If you documented that in a previous patch, there's
> > > no harm and copying and pasting into this one. A future git log user
> > > will thank you for not making them go to lore to try to find the one
> > > patch with the  details.
> >
> > Good point.
> >
> > > Extra credit for creating a PKS theory of
> > > operation document with this detail, unless I missed that?
> >
> > Well...  I've documented and mentioned the thread-local'ness of PKS a lot but
> > I'm pretty close to all of this so it is hard for me to remember where and to
> > what degree that is documented.  I've already reworked the PKS documentation a
> > bit.  So I'll review that.
> >
> > >
> > > > For this reason
> > > > kmap() is not supported.  Rather than leave the kmap mappings to fault
> > > > at random times when users may access them,
> > >
> > > Is that a problem?
> >
> > No.
> 
> What I meant was how random is random and is it distinguishable from
> direct page_address() usage where there is no explicit early failure
> path?

Ok you've convinced me.  I'll drop this.

> 
> >
> > > This instrumentation is also insufficient for
> > > legitimate usages of page_address().
> >
> > True.  Although with this protection those access' are no longer legitimate.
> > And it sounds like it may be worth putting a call in page_address() as well.
> >
> > > Might as well rely on the kernel
> > > developer community being able to debug PKS WARN() splats back to the
> > > source because that will need to be done regardless, given kmap() is
> > > not the only source of false positive access violations.
> >
> > I disagree but I'm happy to drop pgmap_protection_flag_invalid() if that is the
> > consensus.
> >
> > The reason I disagree is that it is generally better to catch errors early
> > rather than later.  Furthermore, this does not change the permissions.  Which
> > means the actual invalid access will also get flagged at the point of use.
> > This allows more debugging information for the user.
> >
> > Do you feel that strongly about removing pgmap_protection_flag_invalid()?
> 
> You haven't convinced me that it matters yet. Do you have an example
> of a kmap() pointer dereference PKS splat where it's not clear from
> the backtrace from the fault handler that a kmap path was involved?
> 
> At a minimum if it stays it seems like something that should be
> wrapped by VM_WARN_ON_ONCE_PAGE() like other page relative memory
> debugging extra checks that get disabled by CONFIG_DEBUG_VM, but the
> assertion that "early is better" needs evidence that "later is too
> ambiguous".

I'll drop this.  It is easier to just leave it out.

Ira

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 39/44] memremap_pages: Add memremap.pks_fault_mode
  2022-02-01  1:16   ` Edgecombe, Rick P
@ 2022-03-02  0:20     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-03-02  0:20 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: hpa, Williams, Dan J, dave.hansen, Yu, Fenghua, linux-kernel

On Mon, Jan 31, 2022 at 05:16:26PM -0800, Edgecombe, Rick P wrote:
> On Thu, 2022-01-27 at 09:55 -0800, ira.weiny@intel.com wrote:
> > +static int param_get_pks_fault_mode(char *buffer, const struct
> > kernel_param *kp)
> > +{
> > +       int ret = 0;
> This doesn't need to be initialized.

Thanks, fixed,
Ira

> 
> > +
> > +       switch (pks_fault_mode) {
> > +       case PKS_MODE_STRICT:
> > +               ret = sysfs_emit(buffer, "strict\n");
> > +               break;
> > +       case PKS_MODE_RELAXED:
> > +               ret = sysfs_emit(buffer, "relaxed\n");
> > +               break;
> > +       default:
> > +               ret = sysfs_emit(buffer, "<unknown>\n");
> > +               break;
> > +       }
> > +
> > +       return ret;
> > +}

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 39/44] memremap_pages: Add memremap.pks_fault_mode
  2022-02-04 19:01   ` Dan Williams
@ 2022-03-02  2:00     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-03-02  2:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, H. Peter Anvin, Fenghua Yu, Rick Edgecombe,
	Linux Kernel Mailing List

On Fri, Feb 04, 2022 at 11:01:55AM -0800, Dan Williams wrote:
> On Thu, Jan 27, 2022 at 9:55 AM <ira.weiny@intel.com> wrote:
> >
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > Some systems may be using pmem in unanticipated ways.  As such, it is
> > possible an foreseen code path to violate the restrictions of the PMEM
> > PKS protections.
> 
> These sentences do not parse for me. How about:
> 
> "When PKS protections for PMEM are enabled the kernel may capture
> stray writes, or it may capture false positive access violations. An
> example of a false positive access violation is a code path that
> neglects to call kmap_{atomic,local_page}, but is otherwise a valid
> access. In the false positive scenario there is no actual risk to data
> integrity, but the kernel still needs to make a decision as to whether
> to report the access violation and continue, or treat the violation as
> fatal. That policy decision is captured in a new pks_fault_mode kernel
> parameter."

That sounds good, added thanks.

> 
> >
> > In order to provide a more seamless integration of the PMEM PKS feature
> 
> Not sure what "seamless integration" means in this context?

Integration of the stray writes into production kernels.  This will help make
that seamless by easing the restrictions on any potential valid users.

I've removed this paragraph though.

> 
> > provide a pks_fault_mode that allows for a relaxed mode should a
> > previously working feature fault on the PKS protected PMEM.
> >
> > 2 modes are available:
> >
> >         'relaxed' (default) -- WARN_ONCE, removed the protections, and
> >         continuing to operate.
> >
> >         'strict' -- BUG_ON/or fault indicating the error.  This is the
> >         most protective of the PMEM memory but may be undesirable in
> >         some configurations.
> >
> > NOTE: The typedef of pks_fault_modes is required to allow
> > param_check_pks_fault() to work automatically for us.  So the typedef
> > checkpatch warning is ignored.
> 
> This doesn't parse for me, why is a typedef needed for a simple
> toggle? Who is "us"?

Missed that 'us'...  ;-)

How about this:

NOTE: The __param_check macro requires a type to correctly verify the
values passed as the module parameter.  Therefore a typedef is made of
the pks_fault_modes and the checkpatch warning regarding new typedefs is
ignored.


> 
> >
> > NOTE: There was some debate about if a 3rd mode called 'silent' should
> > be available.  'silent' would be the same as 'relaxed' but not print any
> > output.  While 'silent' is nice for admins to reduce console/log output
> > it would result in less motivation to fix invalid access to the
> > protected pmem pages.  Therefore, 'silent' is left out.
> >
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> >
> > ---
> > Changes for V8
> >         Use pks_update_exception() instead of abandoning the pkey.
> >         Split out pgmap_protection_flag_invalid() into a separate patch
> >                 for clarity.
> >         From Rick Edgecombe
> >                 Fix sysfs_streq() checks
> >         From Randy Dunlap
> >                 Fix Documentation closing parans
> >
> > Changes for V7
> >         Leverage Rick Edgecombe's fault callback infrastructure to relax invalid
> >                 uses and prevent crashes
> >         From Dan Williams
> >                 Use sysfs_* calls for parameter
> >                 Make pgmap_disable_protection inline
> >                 Remove pfn from warn output
> >         Remove silent parameter option
> > ---
> >  .../admin-guide/kernel-parameters.txt         | 14 ++++
> >  arch/x86/mm/pkeys.c                           |  4 ++
> >  include/linux/mm.h                            |  3 +
> >  mm/memremap.c                                 | 67 +++++++++++++++++++
> >  4 files changed, 88 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index f5a27f067db9..3e70a6194831 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4158,6 +4158,20 @@
> >         pirq=           [SMP,APIC] Manual mp-table setup
> >                         See Documentation/x86/i386/IO-APIC.rst.
> >
> > +       memremap.pks_fault_mode=        [X86] Control the behavior of page map
> > +                       protection violations.  Violations may not be an actual
> > +                       use of the memory but simply an attempt to map it in an
> > +                       incompatible way.
> > +                       (depends on CONFIG_DEVMAP_ACCESS_PROTECTION)
> > +
> > +                       Format: { relaxed | strict }
> > +
> > +                       relaxed - Print a warning, disable the protection and
> > +                                 continue execution.
> > +                       strict - Stop kernel execution via BUG_ON or fault
> > +
> > +                       default: relaxed
> > +
> >         plip=           [PPT,NET] Parallel port network link
> >                         Format: { parport<nr> | timid | 0 }
> >                         See also Documentation/admin-guide/parport.rst.
> > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> > index fa71037c1dd0..e864a9b7828a 100644
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/debugfs.h>             /* debugfs_create_u32()         */
> >  #include <linux/mm_types.h>             /* mm_struct, vma, etc...       */
> >  #include <linux/pkeys.h>                /* PKEY_*                       */
> > +#include <linux/mm.h>                   /* fault callback               */
> >  #include <uapi/asm-generic/mman-common.h>
> >
> >  #include <asm/cpufeature.h>             /* boot_cpu_has, ...            */
> > @@ -243,6 +244,9 @@ static const pks_key_callback pks_key_callbacks[PKS_KEY_NR_CONSUMERS] = {
> >  #ifdef CONFIG_PKS_TEST
> >         [PKS_KEY_TEST]          = pks_test_fault_callback,
> >  #endif
> > +#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
> > +       [PKS_KEY_PGMAP_PROTECTION]   = pgmap_pks_fault_callback,
> > +#endif
> >  };
> >
> >  static bool pks_call_fault_callback(struct pt_regs *regs, unsigned long address,
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 60044de77c54..e900df563437 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1193,6 +1193,9 @@ static inline void pgmap_mk_noaccess(struct page *page)
> >
> >  bool pgmap_protection_available(void);
> >
> > +bool pgmap_pks_fault_callback(struct pt_regs *regs, unsigned long address,
> > +                             bool write);
> > +
> >  #else
> >
> >  static inline void __pgmap_mk_readwrite(struct dev_pagemap *pgmap) { }
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index b75c4f778c59..783b1cd4bb42 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -96,6 +96,73 @@ static void devmap_protection_disable(void)
> >         static_branch_dec(&dev_pgmap_protection_static_key);
> >  }
> >
> > +/*
> > + * Ignore the checkpatch warning because the typedef allows
> 
> Why document forever in perpetuity to ignore a checkpatch warning for
> something that is no longer a patch once it is upstream?

Checkpatch can be run on files.  I can remove the comment and people can just
look at the commit message.  I was just trying to make it clear why the typedef
is required despite an apparent desire to not grow typedefs in the kernel.

> 
> > + * param_check_pks_fault_modes to automatically check the passed value.
> > + */
> > +typedef enum {
> > +       PKS_MODE_STRICT  = 0,
> > +       PKS_MODE_RELAXED = 1,
> > +} pks_fault_modes;
> > +
> > +pks_fault_modes pks_fault_mode = PKS_MODE_RELAXED;
> > +
> > +static int param_set_pks_fault_mode(const char *val, const struct kernel_param *kp)
> > +{
> > +       int ret = -EINVAL;
> > +
> > +       if (sysfs_streq(val, "relaxed")) {
> > +               pks_fault_mode = PKS_MODE_RELAXED;
> > +               ret = 0;
> > +       } else if (sysfs_streq(val, "strict")) {
> > +               pks_fault_mode = PKS_MODE_STRICT;
> > +               ret = 0;
> > +       }
> > +
> > +       return ret;
> > +}
> > +
> > +static int param_get_pks_fault_mode(char *buffer, const struct kernel_param *kp)
> > +{
> > +       int ret = 0;
> > +
> > +       switch (pks_fault_mode) {
> > +       case PKS_MODE_STRICT:
> > +               ret = sysfs_emit(buffer, "strict\n");
> > +               break;
> > +       case PKS_MODE_RELAXED:
> > +               ret = sysfs_emit(buffer, "relaxed\n");
> > +               break;
> > +       default:
> > +               ret = sysfs_emit(buffer, "<unknown>\n");
> > +               break;
> > +       }
> > +
> > +       return ret;
> > +}
> > +
> > +static const struct kernel_param_ops param_ops_pks_fault_modes = {
> > +       .set = param_set_pks_fault_mode,
> > +       .get = param_get_pks_fault_mode,
> > +};
> > +
> > +#define param_check_pks_fault_modes(name, p) \
> > +       __param_check(name, p, pks_fault_modes)
> > +module_param(pks_fault_mode, pks_fault_modes, 0644);
> 
> Is the complexity to change this at runtime necessary? It seems
> sufficient to make this read-only via sysfs and only rely on command
> line toggles to override the default policy.

I don't understand the complexity?

Ira

> 
> > +
> > +bool pgmap_pks_fault_callback(struct pt_regs *regs, unsigned long address,
> > +                             bool write)
> > +{
> > +       /* In strict mode just let the fault handler oops */
> > +       if (pks_fault_mode == PKS_MODE_STRICT)
> > +               return false;
> > +
> > +       WARN_ONCE(1, "Page map protection being disabled");
> > +       pks_update_exception(regs, PKS_KEY_PGMAP_PROTECTION, 0);
> > +       return true;
> > +}
> > +EXPORT_SYMBOL_GPL(pgmap_pks_fault_callback);
> > +
> >  void __pgmap_mk_readwrite(struct dev_pagemap *pgmap)
> >  {
> >         if (!current->pgmap_prot_count++)
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [PATCH V8 40/44] memremap_pages: Add pgmap_protection_flag_invalid()
  2022-02-01  1:37   ` Edgecombe, Rick P
@ 2022-03-02  2:01     ` Ira Weiny
  0 siblings, 0 replies; 145+ messages in thread
From: Ira Weiny @ 2022-03-02  2:01 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: hpa, Williams, Dan J, dave.hansen, Yu, Fenghua, linux-kernel

On Mon, Jan 31, 2022 at 05:37:17PM -0800, Edgecombe, Rick P wrote:
> On Thu, 2022-01-27 at 09:55 -0800, ira.weiny@intel.com wrote:
> > +/*
> > + * pgmap_protection_flag_invalid - Check and flag an invalid use of
> > a pgmap
> > + *                                 protected page
> > + *
> > + * There are code paths which are known to not be compatible with
> > pgmap
> > + * protections.
> 
> This could get hopefully get stale. Maybe the comment should just
> describe what the function does and leave this reasoning to the commit
> log?

Thanks for the review but based on the thread with Dan this patch is dropped.

Thanks,
Ira

> 
> > pgmap_protection_flag_invalid() is provided as a 'relief
> > + * valve' to be used in those functions which are known to be
> > incompatible.
> > + *
> > + * Thus an invalid use case can be flaged with more precise data
> > rather than
> > + * just flagging a fault.  Like the fault handler code this abandons
> 
> In the commit log you called this "the invalid access on fault" and it
> seemed a little clearer to me then "just flagging a fault".
> 
> > the use of
> > + * the PKS key and optionally allows the calling code path to
> > continue based on
> > + * the configuration of the memremap.pks_fault_mode command line
> > + * (and/or sysfs) option.
> 
> It lets the calling code continue regardless right? It just warns if
> !PKS_MODE_STRICT. Why not warn in the case of PKS_MODE_STRICT too?
> 
> Seems surprising that the stricter setting would have less checks.
> 
> > + */
> > +static inline void pgmap_protection_flag_invalid(struct page *page)
> > +{
> > +       if (!pgmap_check_pgmap_prot(page))
> > +               return;
> > +       __pgmap_protection_flag_invalid(page->pgmap);
> > +}

^ permalink raw reply	[flat|nested] 145+ messages in thread

end of thread, other threads:[~2022-03-02  2:01 UTC | newest]

Thread overview: 145+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-27 17:54 [PATCH V8 00/44] PKS/PMEM: Add Stray Write Protection ira.weiny
2022-01-27 17:54 ` [PATCH V8 01/44] entry: Create an internal irqentry_exit_cond_resched() call ira.weiny
2022-01-27 17:54 ` [PATCH V8 02/44] Documentation/protection-keys: Clean up documentation for User Space pkeys ira.weiny
2022-01-28 22:39   ` Dave Hansen
2022-02-01 23:49     ` Ira Weiny
2022-02-01 23:54       ` Dave Hansen
2022-01-27 17:54 ` [PATCH V8 03/44] x86/pkeys: Create pkeys_common.h ira.weiny
2022-01-28 22:43   ` Dave Hansen
2022-02-02  1:00     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 04/44] x86/pkeys: Add additional PKEY helper macros ira.weiny
2022-01-28 22:47   ` Dave Hansen
2022-02-02 20:21     ` Ira Weiny
2022-02-02 20:26       ` Dave Hansen
2022-02-02 20:28         ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 05/44] x86/fpu: Refactor arch_set_user_pkey_access() ira.weiny
2022-01-28 22:50   ` Dave Hansen
2022-02-02 20:22     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 06/44] mm/pkeys: Add Kconfig options for PKS ira.weiny
2022-01-28 22:54   ` Dave Hansen
2022-01-28 23:10     ` Ira Weiny
2022-01-28 23:51       ` Dave Hansen
2022-02-04 19:08         ` Ira Weiny
2022-02-09  5:34           ` Ira Weiny
2022-02-14 19:20             ` Dave Hansen
2022-02-14 23:03               ` Ira Weiny
2022-01-29  0:06   ` Dave Hansen
2022-02-04 19:14     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 07/44] x86/pkeys: Add PKS CPU feature bit ira.weiny
2022-01-28 23:05   ` Dave Hansen
2022-02-04 19:21     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 08/44] x86/fault: Adjust WARN_ON for PKey fault ira.weiny
2022-01-28 23:10   ` Dave Hansen
2022-02-04 20:06     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 09/44] x86/pkeys: Enable PKS on cpus which support it ira.weiny
2022-01-28 23:18   ` Dave Hansen
2022-01-28 23:41     ` Ira Weiny
2022-01-28 23:53       ` Dave Hansen
2022-01-27 17:54 ` [PATCH V8 10/44] Documentation/pkeys: Add initial PKS documentation ira.weiny
2022-01-28 23:57   ` Dave Hansen
2022-01-27 17:54 ` [PATCH V8 11/44] mm/pkeys: Define static PKS key array and default values ira.weiny
2022-01-29  0:02   ` Dave Hansen
2022-02-04 23:54     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 12/44] mm/pkeys: Define PKS page table macros ira.weiny
2022-01-27 17:54 ` [PATCH V8 13/44] mm/pkeys: Add initial PKS Test code ira.weiny
2022-01-31 19:30   ` Edgecombe, Rick P
2022-02-09 23:44     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 14/44] x86/pkeys: Introduce pks_write_pkrs() ira.weiny
2022-01-29  0:12   ` Dave Hansen
2022-01-29  0:16     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 15/44] x86/pkeys: Preserve the PKS MSR on context switch ira.weiny
2022-01-29  0:22   ` Dave Hansen
2022-02-11  6:10     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 16/44] mm/pkeys: Introduce pks_mk_readwrite() ira.weiny
2022-01-31 23:10   ` Edgecombe, Rick P
2022-02-18  2:22     ` Ira Weiny
2022-02-01 17:40   ` Dave Hansen
2022-02-18  4:39     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 17/44] mm/pkeys: Introduce pks_mk_noaccess() ira.weiny
2022-01-27 17:54 ` [PATCH V8 18/44] x86/fault: Add a PKS test fault hook ira.weiny
2022-01-31 19:56   ` Edgecombe, Rick P
2022-02-11 20:40     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 19/44] mm/pkeys: PKS Testing, add pks_mk_*() tests ira.weiny
2022-02-01 17:45   ` Dave Hansen
2022-02-18  5:34     ` Ira Weiny
2022-02-18 15:28       ` Dave Hansen
2022-02-18 17:25         ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 20/44] mm/pkeys: Add PKS test for context switching ira.weiny
2022-02-01 17:43   ` Edgecombe, Rick P
2022-02-22 21:42     ` Ira Weiny
2022-02-01 17:47   ` Edgecombe, Rick P
2022-02-01 19:52     ` Edgecombe, Rick P
2022-02-18  6:03       ` Ira Weiny
2022-02-18  6:02     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 21/44] x86/entry: Add auxiliary pt_regs space ira.weiny
2022-01-27 17:54 ` [PATCH V8 22/44] entry: Pass pt_regs to irqentry_exit_cond_resched() ira.weiny
2022-01-27 17:54 ` [PATCH V8 23/44] entry: Add architecture auxiliary pt_regs save/restore calls ira.weiny
2022-01-27 17:54 ` [PATCH V8 24/44] x86/entry: Define arch_{save|restore}_auxiliary_pt_regs() ira.weiny
2022-01-27 17:54 ` [PATCH V8 25/44] x86/pkeys: Preserve PKRS MSR across exceptions ira.weiny
2022-01-27 17:54 ` [PATCH V8 26/44] x86/fault: Print PKS MSR on fault ira.weiny
2022-02-01 18:13   ` Edgecombe, Rick P
2022-02-18  6:01     ` Ira Weiny
2022-02-18 17:28       ` Edgecombe, Rick P
2022-02-18 20:20         ` Dave Hansen
2022-02-18 20:54           ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 27/44] mm/pkeys: Add PKS exception test ira.weiny
2022-01-27 17:54 ` [PATCH V8 28/44] mm/pkeys: Introduce pks_update_exception() ira.weiny
2022-01-27 17:54 ` [PATCH V8 29/44] mm/pkeys: Introduce PKS fault callbacks ira.weiny
2022-01-27 17:54 ` [PATCH V8 30/44] mm/pkeys: Test setting a PKS key in a custom fault callback ira.weiny
2022-02-01  0:55   ` Edgecombe, Rick P
2022-03-01 15:39     ` Ira Weiny
2022-02-01 17:42   ` Edgecombe, Rick P
2022-02-11 20:44     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 31/44] mm/pkeys: Add pks_available() ira.weiny
2022-01-27 17:54 ` [PATCH V8 32/44] memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION ira.weiny
2022-02-04 15:49   ` Dan Williams
2022-01-27 17:54 ` [PATCH V8 33/44] memremap_pages: Introduce pgmap_protection_available() ira.weiny
2022-02-04 16:19   ` Dan Williams
2022-02-28 16:59     ` Ira Weiny
2022-03-01 15:56       ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 34/44] memremap_pages: Introduce a PGMAP_PROTECTION flag ira.weiny
2022-01-27 17:54 ` [PATCH V8 35/44] memremap_pages: Introduce devmap_protected() ira.weiny
2022-01-27 17:54 ` [PATCH V8 36/44] memremap_pages: Reserve a PKS PKey for eventual use by PMEM ira.weiny
2022-02-01 18:35   ` Edgecombe, Rick P
2022-02-04 17:12     ` Dan Williams
2022-02-05  5:40       ` Ira Weiny
2022-02-05  8:19         ` Dan Williams
2022-02-06 18:14           ` Dan Williams
2022-02-08 22:48           ` Ira Weiny
2022-02-08 23:22             ` Dan Williams
2022-02-08 23:42               ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 37/44] memremap_pages: Set PKS PKey in PTEs if PGMAP_PROTECTIONS is requested ira.weiny
2022-02-04 17:41   ` Dan Williams
2022-03-01 18:15     ` Ira Weiny
2022-01-27 17:54 ` [PATCH V8 38/44] memremap_pages: Define pgmap_mk_{readwrite|noaccess}() calls ira.weiny
2022-02-04 18:35   ` Dan Williams
2022-02-05  0:09     ` Ira Weiny
2022-02-05  0:19       ` Dan Williams
2022-02-05  0:25         ` Dan Williams
2022-02-05  0:27           ` Dan Williams
2022-02-05  5:55             ` Ira Weiny
2022-02-05  6:28               ` Dan Williams
2022-02-22 22:05     ` Ira Weiny
2022-01-27 17:55 ` [PATCH V8 39/44] memremap_pages: Add memremap.pks_fault_mode ira.weiny
2022-02-01  1:16   ` Edgecombe, Rick P
2022-03-02  0:20     ` Ira Weiny
2022-02-04 19:01   ` Dan Williams
2022-03-02  2:00     ` Ira Weiny
2022-01-27 17:55 ` [PATCH V8 40/44] memremap_pages: Add pgmap_protection_flag_invalid() ira.weiny
2022-02-01  1:37   ` Edgecombe, Rick P
2022-03-02  2:01     ` Ira Weiny
2022-02-04 19:18   ` Dan Williams
2022-01-27 17:55 ` [PATCH V8 41/44] kmap: Ensure kmap works for devmap pages ira.weiny
2022-02-04 21:07   ` Dan Williams
2022-03-01 19:45     ` Ira Weiny
2022-03-01 19:50       ` Ira Weiny
2022-03-01 20:05       ` Dan Williams
2022-03-01 23:03         ` Ira Weiny
2022-01-27 17:55 ` [PATCH V8 42/44] dax: Stray access protection for dax_direct_access() ira.weiny
2022-02-04  5:19   ` Dan Williams
2022-03-01 18:13     ` Ira Weiny
2022-01-27 17:55 ` [PATCH V8 43/44] nvdimm/pmem: Enable stray access protection ira.weiny
2022-02-04 21:10   ` Dan Williams
2022-03-01 18:18     ` Ira Weiny
2022-01-27 17:55 ` [PATCH V8 44/44] devdax: " ira.weiny
2022-02-04 21:12   ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).