LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* [PATCH V5 0/7] x86/intel_rdt: Intel Cache Allocation Technology
@ 2015-03-12 23:16 Vikas Shivappa
  2015-03-12 23:16 ` [PATCH 1/7] x86/intel_rdt: Intel Cache Allocation Technology detection Vikas Shivappa
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Vikas Shivappa @ 2015-03-12 23:16 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

This patch adds a new cgroup subsystem to support the new Cache Allocation 
Technology (CAT) feature found in future Intel Xeon Intel processors. CAT is 
part of Resource Director Technology(RDT) or
Platform Shared resource control which provides support to control
Platform shared resources like cache.
More information can be found in the *Intel SDM Volume 3 section 17.15*.

This patch series is *dependent* on the V5 patches for Intel Cache QOS
Monitoring from Matt since the series also implements a common
software cache for the IA32_PQR_MSR :
https://lkml.kernel.org/r/1422038748-21397-1-git-send-email-matt@codeblueprint.co.uk

*All the patches will apply on 4.0-rc3*.

I have added a bit of code that was left out in this series. The h/w 
provides per package CLOSIDs but OS just treats them as global to
simplify the handling.  When the corresponding cache bitmasks are 
changed the change needs to be propagated to all the packages.  CLOSID 
update to IA32_PQR_MSR is already done on a per-cpu basis.

Changes in v5:
- Added support to propagate the cache bit mask update for each
  package.
- Removed the cache bit mask reference in the intel_rdt structure as
  we already maintain a separate closid<->cbm mapping.
- Made a few coding convention changes and added an
  assertion for cgroup count while freeing the CLOSID.

Changes in V4:
- Integrated with the latest V5 CMT patches.
- Changed naming of cgroup to rdt(resource director technology) from
  cat(cache allocation technology). This was done as the RDT is the
  umbrella term for platform shared resources allocation. Hence in
  future it would be easier to add resource allocation to the same 
  cgroup
- Naming changes also applied to a lot of other data structures/APIs.
- Added documentation on cgroup usage for cache allocation to address
  a lot of questions from various academic and industry regarding 
  cache allocation usage.

Changes in V3:
- Implements a common software cache for IA32_PQR_MSR
- Implements support for hsw CAT enumeration. This does not use the brand 
strings like earlier version but does a probe test. The probe test is done only 
on hsw family of processors
- Made a few coding convention, name changes
- Check for lock being held when ClosID manipulation happens

Changes in V2:
- Removed HSW specific enumeration changes. Plan to include it later as a
  separate patch.  
- Fixed the code in prep_arch_switch to be specific for x86 and removed
  x86 defines.
- Fixed cbm_write to not write all 1s when a cgroup is freed.
- Fixed one possible memory leak in init.  
- Changed some of manual bitmap
  manipulation to use the predefined bitmap APIs to make code more readable
- Changed name in sources from cqe to cat
- Global cat enable flag changed to static_key and disabled cgroup early_init
      

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/7] x86/intel_rdt: Intel Cache Allocation Technology detection
  2015-03-12 23:16 [PATCH V5 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
@ 2015-03-12 23:16 ` Vikas Shivappa
  2015-03-12 23:16 ` [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management Vikas Shivappa
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Vikas Shivappa @ 2015-03-12 23:16 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

This patch adds support for the new Cache Allocation Technology (CAT)
feature found in future Intel Xeon processors. CAT is part of Intel
Resource Director Technology(RDT) which enables sharing of processor
resources. This patch includes CPUID enumeration routines for CAT and
new values to track CAT resources to the cpuinfo_x86 structure.

Cache Allocation Technology(CAT) provides a way for the Software
(OS/VMM) to restrict cache allocation to a defined 'subset' of cache
which may be overlapping with other 'subsets'.  This feature is used
when allocating a line in cache ie when pulling new data into the cache.
The programming of the h/w is done via programming  MSRs.

More information about CAT be found in the Intel (R) x86 Architecture
Software Developer Manual, section 17.15.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/cpufeature.h |  6 ++++-
 arch/x86/include/asm/processor.h  |  3 +++
 arch/x86/kernel/cpu/Makefile      |  1 +
 arch/x86/kernel/cpu/common.c      | 15 ++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c   | 51 +++++++++++++++++++++++++++++++++++++++
 init/Kconfig                      | 11 +++++++++
 6 files changed, 86 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/cpu/intel_rdt.c

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 361922d..d97b7cd 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -12,7 +12,7 @@
 #include <asm/disabled-features.h>
 #endif
 
-#define NCAPINTS	13	/* N 32-bit words worth of info */
+#define NCAPINTS	14	/* N 32-bit words worth of info */
 #define NBUGINTS	1	/* N 32-bit bug flags */
 
 /*
@@ -228,6 +228,7 @@
 #define X86_FEATURE_RTM		( 9*32+11) /* Restricted Transactional Memory */
 #define X86_FEATURE_CQM		( 9*32+12) /* Cache QoS Monitoring */
 #define X86_FEATURE_MPX		( 9*32+14) /* Memory Protection Extension */
+#define X86_FEATURE_RDT		( 9*32+15) /* Resource Allocation */
 #define X86_FEATURE_AVX512F	( 9*32+16) /* AVX-512 Foundation */
 #define X86_FEATURE_RDSEED	( 9*32+18) /* The RDSEED instruction */
 #define X86_FEATURE_ADX		( 9*32+19) /* The ADCX and ADOX instructions */
@@ -249,6 +250,9 @@
 /* Intel-defined CPU QoS Sub-leaf, CPUID level 0x0000000F:1 (edx), word 12 */
 #define X86_FEATURE_CQM_OCCUP_LLC (12*32+ 0) /* LLC occupancy monitoring if 1 */
 
+/* Intel-defined CPU features, CPUID level 0x00000010:0 (ebx), word 13 */
+#define X86_FEATURE_CAT_L3	(13*32 + 1) /* Cache QOS Enforcement L3 */
+
 /*
  * BUG word(s)
  */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index a12d50e..ad96bdd 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -114,6 +114,9 @@ struct cpuinfo_x86 {
 	int			x86_cache_occ_scale;	/* scale to bytes */
 	int			x86_power;
 	unsigned long		loops_per_jiffy;
+	/* Cache Allocation Technology values */
+	u16			x86_cat_cbmlength;
+	u16			x86_cat_closs;
 	/* cpuid returned max cores value: */
 	u16			 x86_max_cores;
 	u16			apicid;
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 6c1ca13..eda32ff 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE)	+= perf_event_intel_uncore.o \
 					   perf_event_intel_uncore_nhmex.o
 endif
 
+obj-$(CONFIG_CGROUP_RDT) 		+= intel_rdt.o
 
 obj-$(CONFIG_X86_MCE)			+= mcheck/
 obj-$(CONFIG_MTRR)			+= mtrr/
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 1cd4a1a..1d70385 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -670,6 +670,21 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
 		}
 	}
 
+	/* Additional Intel-defined flags: level 0x00000010 */
+	if (c->cpuid_level >= 0x00000010) {
+		u32 eax, ebx, ecx, edx;
+
+		cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
+		c->x86_capability[13] = ebx;
+
+		if (cpu_has(c, X86_FEATURE_CAT_L3)) {
+
+			cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
+			c->x86_cat_closs = edx + 1;
+			c->x86_cat_cbmlength = eax + 1;
+		}
+	}
+
 	/* AMD-defined flags: level 0x80000001 */
 	xlvl = cpuid_eax(0x80000000);
 	c->extended_cpuid_level = xlvl;
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
new file mode 100644
index 0000000..46ce449
--- /dev/null
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -0,0 +1,51 @@
+/*
+ * Resource Director Technology(RDT) code
+ *
+ * Copyright (C) 2014 Intel Corporation
+ *
+ * 2014-09-10 Written by Vikas Shivappa
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * More information about RDT be found in the Intel (R) x86 Architecture
+ * Software Developer Manual, section 17.15.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/spinlock.h>
+
+static inline bool rdt_supported(struct cpuinfo_x86 *c)
+{
+	if (cpu_has(c, X86_FEATURE_RDT))
+		return true;
+
+	return false;
+}
+
+static int __init rdt_late_init(void)
+{
+	struct cpuinfo_x86 *c = &boot_cpu_data;
+	int maxid, cbm_len;
+
+	if (!rdt_supported(c))
+		return -ENODEV;
+
+	maxid = c->x86_cat_closs;
+	cbm_len = c->x86_cat_cbmlength;
+
+	pr_info("cbmlength:%u,Closs: %u\n", cbm_len, maxid);
+
+	return 0;
+}
+
+late_initcall(rdt_late_init);
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..d8b5a19 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -968,6 +968,17 @@ config CPUSETS
 
 	  Say N if unsure.
 
+config CGROUP_RDT
+	bool "Resource Director Technology cgroup subsystem"
+	depends on X86_64 && CPU_SUP_INTEL
+	help
+	  This option provides a cgroup to allocate Platform shared
+	  resources. Among the shared resources, current implementation
+	  focuses on L3 Cache. Using the interface user can specify the
+	  amount of L3 cache space into which an application can fill.
+
+	  Say N if unsure.
+
 config PROC_PID_CPUSET
 	bool "Include legacy /proc/<pid>/cpuset file"
 	depends on CPUSETS
-- 
1.9.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management
  2015-03-12 23:16 [PATCH V5 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
  2015-03-12 23:16 ` [PATCH 1/7] x86/intel_rdt: Intel Cache Allocation Technology detection Vikas Shivappa
@ 2015-03-12 23:16 ` Vikas Shivappa
  2015-03-12 23:16 ` [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT Vikas Shivappa
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Vikas Shivappa @ 2015-03-12 23:16 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

This patch adds a cgroup subsystem to support Intel Resource Director
Technology(RDT) or Platform Shared resources Control. The resources that
are currently supported for sharing is Last level cache
(Cache Allocation Technology or CAT).
When a RDT cgroup is created it has a CLOSid and CBM associated with it
which are inherited from its parent. A Class of service(CLOS) in Cache
Allocation is represented by a CLOSid. CLOSid is internal to the kernel
and not exposed to user. Cache bitmask(CBM) represents one cache
'subset'. Root cgroup would have all available bits set for its CBM and
would be assigned the CLOSid 0.

CLOSid allocation is tracked using a separate bitmap. The maximum number
of CLOSids is specified by the h/w during CPUID enumeration and the
kernel simply throws an -ENOSPC when it runs out of CLOSids.

Each CBM has an associated CLOSid. If multiple cgroups have the same CBM
they would also have the same CLOSid. The reference count parameter in
CLOSid-CBM map keeps track of how many cgroups are using each
CLOSid<->CBM mapping.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/intel_rdt.h |  38 +++++++++++++++
 arch/x86/kernel/cpu/intel_rdt.c  | 101 ++++++++++++++++++++++++++++++++++++---
 include/linux/cgroup_subsys.h    |   4 ++
 3 files changed, 137 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/include/asm/intel_rdt.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
new file mode 100644
index 0000000..87af1a5
--- /dev/null
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -0,0 +1,38 @@
+#ifndef _RDT_H_
+#define _RDT_H_
+
+#ifdef CONFIG_CGROUP_RDT
+
+#include <linux/cgroup.h>
+
+struct rdt_subsys_info {
+	/* Clos Bitmap to keep track of available CLOSids.*/
+	unsigned long *closmap;
+};
+
+struct intel_rdt {
+	struct cgroup_subsys_state css;
+	/* Class of service for the cgroup.*/
+	unsigned int clos;
+};
+
+struct clos_cbm_map {
+	unsigned long cbm;
+	unsigned int cgrp_count;
+};
+
+/*
+ * Return rdt group corresponding to this container.
+ */
+static inline struct intel_rdt *css_rdt(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct intel_rdt, css) : NULL;
+}
+
+static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
+{
+	return css_rdt(ir->css.parent);
+}
+
+#endif
+#endif
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 46ce449..3726f41 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -23,10 +23,19 @@
 #include <linux/slab.h>
 #include <linux/err.h>
 #include <linux/spinlock.h>
+#include <asm/intel_rdt.h>
 
-static inline bool rdt_supported(struct cpuinfo_x86 *c)
+/*
+ * ccmap maintains 1:1 mapping between CLOSid and cbm.
+ */
+static struct clos_cbm_map *ccmap;
+static struct rdt_subsys_info rdtss_info;
+static DEFINE_MUTEX(rdt_group_mutex);
+struct intel_rdt rdt_root_group;
+
+static inline bool cat_supported(struct cpuinfo_x86 *c)
 {
-	if (cpu_has(c, X86_FEATURE_RDT))
+	if (cpu_has(c, X86_FEATURE_CAT_L3))
 		return true;
 
 	return false;
@@ -35,17 +44,97 @@ static inline bool rdt_supported(struct cpuinfo_x86 *c)
 static int __init rdt_late_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
+	static struct clos_cbm_map *ccm;
+	size_t sizeb;
 	int maxid, cbm_len;
 
-	if (!rdt_supported(c))
+	if (!cat_supported(c)) {
+		rdt_root_group.css.ss->disabled = 1;
 		return -ENODEV;
+	} else {
+		maxid = c->x86_cat_closs;
+		cbm_len = c->x86_cat_cbmlength;
+		sizeb = BITS_TO_LONGS(maxid) * sizeof(long);
+
+		rdtss_info.closmap = kzalloc(sizeb, GFP_KERNEL);
+		if (!rdtss_info.closmap)
+			return -ENOMEM;
 
-	maxid = c->x86_cat_closs;
-	cbm_len = c->x86_cat_cbmlength;
+		sizeb = maxid * sizeof(struct clos_cbm_map);
+		ccmap = kzalloc(sizeb, GFP_KERNEL);
+		if (!ccmap) {
+			kfree(rdtss_info.closmap);
+			return -ENOMEM;
+		}
 
-	pr_info("cbmlength:%u,Closs: %u\n", cbm_len, maxid);
+		set_bit(0, rdtss_info.closmap);
+		rdt_root_group.clos = 0;
+
+		ccm = &ccmap[0];
+		ccm->cbm = (u32)((u64)(1 << cbm_len) - 1);
+		ccm->cgrp_count++;
+
+		pr_info("cbmlength:%u,Closs: %u\n", cbm_len, maxid);
+	}
 
 	return 0;
 }
 
 late_initcall(rdt_late_init);
+
+/*
+* Called with the rdt_group_mutex held.
+*/
+static int rdt_free_closid(struct intel_rdt *ir)
+{
+
+	lockdep_assert_held(&rdt_group_mutex);
+
+	WARN_ON(!ccmap[ir->clos].cgrp_count);
+	ccmap[ir->clos].cgrp_count--;
+	if (!ccmap[ir->clos].cgrp_count)
+		clear_bit(ir->clos, rdtss_info.closmap);
+
+	return 0;
+}
+
+static struct cgroup_subsys_state *
+rdt_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct intel_rdt *parent = css_rdt(parent_css);
+	struct intel_rdt *ir;
+
+	/*
+	 * Cannot return failure on systems with no Cache Allocation
+	 * as the cgroup_init does not handle failures gracefully.
+	 */
+	if (!parent)
+		return &rdt_root_group.css;
+
+	ir = kzalloc(sizeof(struct intel_rdt), GFP_KERNEL);
+	if (!ir)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_lock(&rdt_group_mutex);
+	ir->clos = parent->clos;
+	ccmap[parent->clos].cgrp_count++;
+	mutex_unlock(&rdt_group_mutex);
+
+	return &ir->css;
+}
+
+static void rdt_css_free(struct cgroup_subsys_state *css)
+{
+	struct intel_rdt *ir = css_rdt(css);
+
+	mutex_lock(&rdt_group_mutex);
+	rdt_free_closid(ir);
+	kfree(ir);
+	mutex_unlock(&rdt_group_mutex);
+}
+
+struct cgroup_subsys rdt_cgrp_subsys = {
+	.css_alloc			= rdt_css_alloc,
+	.css_free			= rdt_css_free,
+	.early_init			= 0,
+};
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index e4a96fb..81c803d 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -47,6 +47,10 @@ SUBSYS(net_prio)
 SUBSYS(hugetlb)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_RDT)
+SUBSYS(rdt)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
-- 
1.9.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT
  2015-03-12 23:16 [PATCH V5 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
  2015-03-12 23:16 ` [PATCH 1/7] x86/intel_rdt: Intel Cache Allocation Technology detection Vikas Shivappa
  2015-03-12 23:16 ` [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management Vikas Shivappa
@ 2015-03-12 23:16 ` Vikas Shivappa
  2015-04-09 20:56   ` Marcelo Tosatti
  2015-03-12 23:16 ` [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT Vikas Shivappa
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Vikas Shivappa @ 2015-03-12 23:16 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

Add support for cache bit mask manipulation. The change adds a file to
the RDT cgroup which represents the CBM(cache bit mask) for the cgroup.

The RDT cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
cgroup never fails.  When a child cgroup is created it inherits the
CLOSid and the CBM from its parent.  When a user changes the default
CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
used before. If the new CBM is the one that is already used, the
count for that CLOSid<->CBM is incremented. The changing of 'cbm'
may fail with -ENOSPC once the kernel runs out of maximum CLOSids it
can support.
User can create as many cgroups as he wants but having different CBMs
at the same time is restricted by the maximum number of CLOSids
(multiple cgroups can have the same CBM).
Kernel maintains a CLOSid<->cbm mapping which keeps count
of cgroups using a CLOSid.

The tasks in the CAT cgroup would get to fill the LLC cache represented
by the cgroup's 'cbm' file.

Reuse of CLOSids for cgroups with same bitmask also has following
advantages:
- This helps to use the scant CLOSids optimally.
- This also implies that during context switch, write to PQR-MSR is done
only when a task with a different bitmask is scheduled in.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/intel_rdt.h |   3 +
 arch/x86/kernel/cpu/intel_rdt.c  | 205 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 208 insertions(+)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 87af1a5..0ed28d9 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,6 +4,9 @@
 #ifdef CONFIG_CGROUP_RDT
 
 #include <linux/cgroup.h>
+#define MAX_CBM_LENGTH			32
+#define IA32_L3_CBM_BASE		0xc90
+#define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
 
 struct rdt_subsys_info {
 	/* Clos Bitmap to keep track of available CLOSids.*/
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 3726f41..495497a 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -33,6 +33,9 @@ static struct rdt_subsys_info rdtss_info;
 static DEFINE_MUTEX(rdt_group_mutex);
 struct intel_rdt rdt_root_group;
 
+#define rdt_for_each_child(pos_css, parent_ir)		\
+	css_for_each_child((pos_css), &(parent_ir)->css)
+
 static inline bool cat_supported(struct cpuinfo_x86 *c)
 {
 	if (cpu_has(c, X86_FEATURE_CAT_L3))
@@ -83,6 +86,31 @@ static int __init rdt_late_init(void)
 late_initcall(rdt_late_init);
 
 /*
+ * Allocates a new closid from unused closids.
+ * Called with the rdt_group_mutex held.
+ */
+
+static int rdt_alloc_closid(struct intel_rdt *ir)
+{
+	unsigned int id;
+	unsigned int maxid;
+
+	lockdep_assert_held(&rdt_group_mutex);
+
+	maxid = boot_cpu_data.x86_cat_closs;
+	id = find_next_zero_bit(rdtss_info.closmap, maxid, 0);
+	if (id == maxid)
+		return -ENOSPC;
+
+	set_bit(id, rdtss_info.closmap);
+	WARN_ON(ccmap[id].cgrp_count);
+	ccmap[id].cgrp_count++;
+	ir->clos = id;
+
+	return 0;
+}
+
+/*
 * Called with the rdt_group_mutex held.
 */
 static int rdt_free_closid(struct intel_rdt *ir)
@@ -133,8 +161,185 @@ static void rdt_css_free(struct cgroup_subsys_state *css)
 	mutex_unlock(&rdt_group_mutex);
 }
 
+/*
+ * Tests if atleast two contiguous bits are set.
+ */
+
+static inline bool cbm_is_contiguous(unsigned long var)
+{
+	unsigned long first_bit, zero_bit;
+	unsigned long maxcbm = MAX_CBM_LENGTH;
+
+	if (bitmap_weight(&var, maxcbm) < 2)
+		return false;
+
+	first_bit = find_next_bit(&var, maxcbm, 0);
+	zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
+
+	if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
+		return false;
+
+	return true;
+}
+
+static int cat_cbm_read(struct seq_file *m, void *v)
+{
+	struct intel_rdt *ir = css_rdt(seq_css(m));
+
+	seq_printf(m, "%08lx\n", ccmap[ir->clos].cbm);
+	return 0;
+}
+
+static int validate_cbm(struct intel_rdt *ir, unsigned long cbmvalue)
+{
+	struct intel_rdt *par, *c;
+	struct cgroup_subsys_state *css;
+	unsigned long *cbm_tmp;
+
+	if (!cbm_is_contiguous(cbmvalue)) {
+		pr_info("cbm should have >= 2 bits and be contiguous\n");
+		return -EINVAL;
+	}
+
+	par = parent_rdt(ir);
+	cbm_tmp = &ccmap[par->clos].cbm;
+	if (!bitmap_subset(&cbmvalue, cbm_tmp, MAX_CBM_LENGTH))
+		return -EINVAL;
+
+	rcu_read_lock();
+	rdt_for_each_child(css, ir) {
+		c = css_rdt(css);
+		cbm_tmp = &ccmap[c->clos].cbm;
+		if (!bitmap_subset(cbm_tmp, &cbmvalue, MAX_CBM_LENGTH)) {
+			pr_info("Children's mask not a subset\n");
+			rcu_read_unlock();
+			return -EINVAL;
+		}
+	}
+
+	rcu_read_unlock();
+	return 0;
+}
+
+static bool cbm_search(unsigned long cbm, int *closid)
+{
+	int maxid = boot_cpu_data.x86_cat_closs;
+	unsigned int i;
+
+	for (i = 0; i < maxid; i++)
+		if (bitmap_equal(&cbm, &ccmap[i].cbm, MAX_CBM_LENGTH)) {
+			*closid = i;
+			return true;
+		}
+
+	return false;
+}
+
+static void cbmmap_dump(void)
+{
+	int i;
+
+	pr_debug("CBMMAP\n");
+	for (i = 0; i < boot_cpu_data.x86_cat_closs; i++)
+		pr_debug("cbm: 0x%x,cgrp_count: %u\n",
+		 (unsigned int)ccmap[i].cbm, ccmap[i].cgrp_count);
+}
+
+static void cpu_cbm_update(void *info)
+{
+	unsigned int closid = *((unsigned int *)info);
+
+	wrmsrl(CBM_FROM_INDEX(closid), ccmap[closid].cbm);
+}
+
+static inline void cbm_update(unsigned int closid)
+{
+	int pkg_id = -1;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		if (pkg_id == topology_physical_package_id(cpu))
+			continue;
+		smp_call_function_single(cpu, cpu_cbm_update, &closid, 1);
+		pkg_id = topology_physical_package_id(cpu);
+
+	}
+}
+
+/*
+ * rdt_cbm_write() - Validates and writes the cache bit mask(cbm)
+ * to the IA32_L3_MASK_n and also store the same in the ccmap.
+ *
+ * CLOSids are reused for cgroups which have same bitmask.
+ * - This helps to use the scant CLOSids optimally.
+ * - This also implies that at context switch write
+ * to PQR-MSR is done only when a task with a
+ * different bitmask is scheduled in.
+ */
+
+static int cat_cbm_write(struct cgroup_subsys_state *css,
+				 struct cftype *cft, u64 cbmvalue)
+{
+	struct intel_rdt *ir = css_rdt(css);
+	ssize_t err = 0;
+	unsigned long cbm;
+	unsigned long *cbm_tmp;
+	unsigned int closid;
+	u32 cbm_mask =
+		(u32)((u64)(1 << boot_cpu_data.x86_cat_cbmlength) - 1);
+
+	if (ir == &rdt_root_group)
+		return -EPERM;
+
+	/*
+	* Need global mutex as cbm write may allocate a closid.
+	*/
+	mutex_lock(&rdt_group_mutex);
+	cbm = cbmvalue & cbm_mask;
+	cbm_tmp = &ccmap[ir->clos].cbm;
+
+	if (bitmap_equal(&cbm, cbm_tmp, MAX_CBM_LENGTH))
+		goto out;
+
+	err = validate_cbm(ir, cbm);
+	if (err)
+		goto out;
+
+	rdt_free_closid(ir);
+
+	if (cbm_search(cbm, &closid)) {
+		ir->clos = closid;
+		ccmap[ir->clos].cgrp_count++;
+	} else {
+		err = rdt_alloc_closid(ir);
+		if (err)
+			goto out;
+
+		ccmap[ir->clos].cbm = cbm;
+		cbm_update(ir->clos);
+	}
+
+	cbmmap_dump();
+
+out:
+
+	mutex_unlock(&rdt_group_mutex);
+	return err;
+}
+
+static struct cftype rdt_files[] = {
+	{
+		.name = "cbm",
+		.seq_show = cat_cbm_read,
+		.write_u64 = cat_cbm_write,
+		.mode = 0666,
+	},
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys rdt_cgrp_subsys = {
 	.css_alloc			= rdt_css_alloc,
 	.css_free			= rdt_css_free,
+	.legacy_cftypes			= rdt_files,
 	.early_init			= 0,
 };
-- 
1.9.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT
  2015-03-12 23:16 [PATCH V5 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
                   ` (2 preceding siblings ...)
  2015-03-12 23:16 ` [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT Vikas Shivappa
@ 2015-03-12 23:16 ` Vikas Shivappa
  2015-03-12 23:16 ` [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR Vikas Shivappa
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Vikas Shivappa @ 2015-03-12 23:16 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

Adds support for IA32_PQR_ASSOC MSR writes during task scheduling.

The high 32 bits in the per processor MSR IA32_PQR_ASSOC represents the
CLOSid. During context switch kernel implements this by writing the
CLOSid of the cgroup to which the task belongs to the CPU's
IA32_PQR_ASSOC MSR.

For Cache Allocation, this would let the task fill in the cache 'subset'
represented by the cgroup's Cache bit mask(CBM).

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/intel_rdt.h | 55 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/switch_to.h |  3 +++
 arch/x86/kernel/cpu/intel_rdt.c  |  4 ++-
 kernel/sched/core.c              |  1 +
 kernel/sched/sched.h             |  3 +++
 5 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 0ed28d9..6383a24 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,9 +4,13 @@
 #ifdef CONFIG_CGROUP_RDT
 
 #include <linux/cgroup.h>
+
+#define MSR_IA32_PQR_ASSOC		0xc8f
 #define MAX_CBM_LENGTH			32
 #define IA32_L3_CBM_BASE		0xc90
 #define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
+DECLARE_PER_CPU(unsigned int, x86_cpu_clos);
+extern struct static_key rdt_enable_key;
 
 struct rdt_subsys_info {
 	/* Clos Bitmap to keep track of available CLOSids.*/
@@ -24,6 +28,11 @@ struct clos_cbm_map {
 	unsigned int cgrp_count;
 };
 
+static inline bool rdt_enabled(void)
+{
+	return static_key_false(&rdt_enable_key);
+}
+
 /*
  * Return rdt group corresponding to this container.
  */
@@ -37,5 +46,51 @@ static inline struct intel_rdt *parent_rdt(struct intel_rdt *ir)
 	return css_rdt(ir->css.parent);
 }
 
+/*
+ * Return rdt group to which this task belongs.
+ */
+static inline struct intel_rdt *task_rdt(struct task_struct *task)
+{
+	return css_rdt(task_css(task, rdt_cgrp_id));
+}
+
+/*
+ * rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
+ * if the current Closid is different than the new one.
+ */
+
+static inline void rdt_sched_in(struct task_struct *task)
+{
+	struct intel_rdt *ir;
+	unsigned int clos;
+
+	if (!rdt_enabled())
+		return;
+
+	/*
+	 * This needs to be fixed after CQM code stabilizes
+	 * to cache the whole PQR instead of just CLOSid.
+	 * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
+	 * Should not write a 0 to the low 10 bits of PQR
+	 * and corrupt RMID.
+	 */
+	clos = this_cpu_read(x86_cpu_clos);
+
+	rcu_read_lock();
+	ir = task_rdt(task);
+	if (ir->clos == clos) {
+		rcu_read_unlock();
+		return;
+	}
+
+	wrmsr(MSR_IA32_PQR_ASSOC, 0, ir->clos);
+	this_cpu_write(x86_cpu_clos, ir->clos);
+	rcu_read_unlock();
+}
+
+#else
+
+static inline void rdt_sched_in(struct task_struct *task) {}
+
 #endif
 #endif
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 751bf4b..82ef4b3 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -8,6 +8,9 @@ struct tss_struct;
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		      struct tss_struct *tss);
 
+#include <asm/intel_rdt.h>
+#define post_arch_switch(current)	rdt_sched_in(current)
+
 #ifdef CONFIG_X86_32
 
 #ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 495497a..0330791 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -32,6 +32,8 @@ static struct clos_cbm_map *ccmap;
 static struct rdt_subsys_info rdtss_info;
 static DEFINE_MUTEX(rdt_group_mutex);
 struct intel_rdt rdt_root_group;
+struct static_key __read_mostly rdt_enable_key = STATIC_KEY_INIT_FALSE;
+DEFINE_PER_CPU(unsigned int, x86_cpu_clos);
 
 #define rdt_for_each_child(pos_css, parent_ir)		\
 	css_for_each_child((pos_css), &(parent_ir)->css)
@@ -76,7 +78,7 @@ static int __init rdt_late_init(void)
 		ccm = &ccmap[0];
 		ccm->cbm = (u32)((u64)(1 << cbm_len) - 1);
 		ccm->cgrp_count++;
-
+		static_key_slow_inc(&rdt_enable_key);
 		pr_info("cbmlength:%u,Closs: %u\n", cbm_len, maxid);
 	}
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f0f831e..93ff61b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2206,6 +2206,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 	prev_state = prev->state;
 	vtime_task_switch(prev);
 	finish_arch_switch(prev);
+	post_arch_switch(current);
 	perf_event_task_sched_in(prev, current);
 	finish_lock_switch(rq, prev);
 	finish_arch_post_lock_switch();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dc0f435..0b3c191 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1026,6 +1026,9 @@ static inline int task_on_rq_migrating(struct task_struct *p)
 #ifndef finish_arch_switch
 # define finish_arch_switch(prev)	do { } while (0)
 #endif
+#ifndef post_arch_switch
+# define post_arch_switch(current)	do { } while (0)
+#endif
 #ifndef finish_arch_post_lock_switch
 # define finish_arch_post_lock_switch()	do { } while (0)
 #endif
-- 
1.9.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR
  2015-03-12 23:16 [PATCH V5 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
                   ` (3 preceding siblings ...)
  2015-03-12 23:16 ` [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT Vikas Shivappa
@ 2015-03-12 23:16 ` Vikas Shivappa
  2015-03-12 23:16 ` [PATCH 6/7] x86/intel_rdt: Intel haswell CAT enumeration Vikas Shivappa
  2015-03-12 23:16 ` [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide Vikas Shivappa
  6 siblings, 0 replies; 20+ messages in thread
From: Vikas Shivappa @ 2015-03-12 23:16 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

This patch implements a common software cache for IA32_PQR_MSR(RMID 0:9,
    CLOSId 32:63) to be used by both CMT and CAT. CMT updates the RMID
where as CAT updates the CLOSid in the software cache. When the new
RMID/CLOSid value is different from the cached values, IA32_PQR_MSR is
updated. Since the measured rdmsr latency for IA32_PQR_MSR is very
high(~250 cycles) this software cache is necessary to avoid reading the
MSR to compare the current CLOSid value.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/include/asm/intel_rdt.h           | 31 +++++++++++++++---------------
 arch/x86/include/asm/rdt_common.h          | 13 +++++++++++++
 arch/x86/kernel/cpu/perf_event_intel_cqm.c | 20 +++++++------------
 3 files changed, 36 insertions(+), 28 deletions(-)
 create mode 100644 arch/x86/include/asm/rdt_common.h

diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
index 6383a24..5a8139e 100644
--- a/arch/x86/include/asm/intel_rdt.h
+++ b/arch/x86/include/asm/intel_rdt.h
@@ -4,12 +4,13 @@
 #ifdef CONFIG_CGROUP_RDT
 
 #include <linux/cgroup.h>
+#include <asm/rdt_common.h>
 
-#define MSR_IA32_PQR_ASSOC		0xc8f
 #define MAX_CBM_LENGTH			32
 #define IA32_L3_CBM_BASE		0xc90
 #define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
-DECLARE_PER_CPU(unsigned int, x86_cpu_clos);
+
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
 extern struct static_key rdt_enable_key;
 
 struct rdt_subsys_info {
@@ -62,30 +63,30 @@ static inline struct intel_rdt *task_rdt(struct task_struct *task)
 static inline void rdt_sched_in(struct task_struct *task)
 {
 	struct intel_rdt *ir;
-	unsigned int clos;
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+	unsigned long flags;
 
 	if (!rdt_enabled())
 		return;
 
-	/*
-	 * This needs to be fixed after CQM code stabilizes
-	 * to cache the whole PQR instead of just CLOSid.
-	 * PQR has closid in high 32 bits and CQM-RMID in low 10 bits.
-	 * Should not write a 0 to the low 10 bits of PQR
-	 * and corrupt RMID.
-	 */
-	clos = this_cpu_read(x86_cpu_clos);
-
+	raw_spin_lock_irqsave(&state->lock, flags);
 	rcu_read_lock();
 	ir = task_rdt(task);
-	if (ir->clos == clos) {
+	if (ir->clos == state->clos) {
 		rcu_read_unlock();
+		raw_spin_unlock_irqrestore(&state->lock, flags);
 		return;
 	}
 
-	wrmsr(MSR_IA32_PQR_ASSOC, 0, ir->clos);
-	this_cpu_write(x86_cpu_clos, ir->clos);
+	/*
+	 * PQR has closid in high 32 bits and CQM-RMID
+	 * in low 10 bits. Rewrite the exsting rmid from
+	 * software cache.
+	 */
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, ir->clos);
+	state->clos = ir->clos;
 	rcu_read_unlock();
+	raw_spin_unlock_irqrestore(&state->lock, flags);
 }
 
 #else
diff --git a/arch/x86/include/asm/rdt_common.h b/arch/x86/include/asm/rdt_common.h
new file mode 100644
index 0000000..c87f908
--- /dev/null
+++ b/arch/x86/include/asm/rdt_common.h
@@ -0,0 +1,13 @@
+#ifndef _X86_RDT_H_
+#define _X86_RDT_H_
+
+#define MSR_IA32_PQR_ASSOC	0x0c8f
+
+struct intel_pqr_state {
+	raw_spinlock_t    lock;
+	int     rmid;
+	int     clos;
+	int       cnt;
+};
+
+#endif
diff --git a/arch/x86/kernel/cpu/perf_event_intel_cqm.c b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
index 596d1ec..63c52e0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_cqm.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_cqm.c
@@ -7,22 +7,16 @@
 #include <linux/perf_event.h>
 #include <linux/slab.h>
 #include <asm/cpu_device_id.h>
+#include <asm/rdt_common.h>
 #include "perf_event.h"
 
-#define MSR_IA32_PQR_ASSOC	0x0c8f
 #define MSR_IA32_QM_CTR		0x0c8e
 #define MSR_IA32_QM_EVTSEL	0x0c8d
 
 static unsigned int cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 
-struct intel_cqm_state {
-	raw_spinlock_t		lock;
-	int			rmid;
-	int 			cnt;
-};
-
-static DEFINE_PER_CPU(struct intel_cqm_state, cqm_state);
+DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
 
 /*
  * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
@@ -931,7 +925,7 @@ out:
 
 static void intel_cqm_event_start(struct perf_event *event, int mode)
 {
-	struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
 	unsigned int rmid = event->hw.cqm_rmid;
 	unsigned long flags;
 
@@ -948,14 +942,14 @@ static void intel_cqm_event_start(struct perf_event *event, int mode)
 		WARN_ON_ONCE(state->rmid);
 
 	state->rmid = rmid;
-	wrmsrl(MSR_IA32_PQR_ASSOC, state->rmid);
+	wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, state->clos);
 
 	raw_spin_unlock_irqrestore(&state->lock, flags);
 }
 
 static void intel_cqm_event_stop(struct perf_event *event, int mode)
 {
-	struct intel_cqm_state *state = this_cpu_ptr(&cqm_state);
+	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
 	unsigned long flags;
 
 	if (event->hw.cqm_state & PERF_HES_STOPPED)
@@ -968,7 +962,7 @@ static void intel_cqm_event_stop(struct perf_event *event, int mode)
 
 	if (!--state->cnt) {
 		state->rmid = 0;
-		wrmsrl(MSR_IA32_PQR_ASSOC, 0);
+		wrmsr(MSR_IA32_PQR_ASSOC, 0, state->clos);
 	} else {
 		WARN_ON_ONCE(!state->rmid);
 	}
@@ -1213,7 +1207,7 @@ static inline void cqm_pick_event_reader(int cpu)
 
 static void intel_cqm_cpu_prepare(unsigned int cpu)
 {
-	struct intel_cqm_state *state = &per_cpu(cqm_state, cpu);
+	struct intel_pqr_state *state = &per_cpu(pqr_state, cpu);
 	struct cpuinfo_x86 *c = &cpu_data(cpu);
 
 	raw_spin_lock_init(&state->lock);
-- 
1.9.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 6/7] x86/intel_rdt: Intel haswell CAT enumeration
  2015-03-12 23:16 [PATCH V5 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
                   ` (4 preceding siblings ...)
  2015-03-12 23:16 ` [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR Vikas Shivappa
@ 2015-03-12 23:16 ` Vikas Shivappa
  2015-03-12 23:16 ` [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide Vikas Shivappa
  6 siblings, 0 replies; 20+ messages in thread
From: Vikas Shivappa @ 2015-03-12 23:16 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

CAT(Cache Allocation Technology) on hsw needs to be enumerated
separately. CAT is only supported on certain HSW SKUs.  This patch does
a probe test for hsw CPUs by writing a CLOSid into high 32 bits of
IA32_PQR_MSR and see if the bits stick. The probe test is only done
after confirming that the CPU is HSW.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 arch/x86/kernel/cpu/intel_rdt.c | 42 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
index 0330791..aa78711 100644
--- a/arch/x86/kernel/cpu/intel_rdt.c
+++ b/arch/x86/kernel/cpu/intel_rdt.c
@@ -38,11 +38,53 @@ DEFINE_PER_CPU(unsigned int, x86_cpu_clos);
 #define rdt_for_each_child(pos_css, parent_ir)		\
 	css_for_each_child((pos_css), &(parent_ir)->css)
 
+/*
+ * hsw_probetest() - Have to do probe
+ * test for Intel haswell CPUs as it does not have
+ * CPUID enumeration support for CAT.
+ *
+ * Probes by writing to the high 32 bits(CLOSid)
+ * of the IA32_PQR_MSR and testing if the bits stick.
+ * Then hardcode the max CLOS and max bitmask length on hsw.
+ */
+
+static inline bool hsw_probetest(void)
+{
+	u32 l, h_old, h_new, h_tmp;
+
+	if (rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_old))
+		return false;
+
+	/*
+	 * Default value is always 0 if feature is present.
+	 */
+	h_tmp = h_old ^ 0x1U;
+	if (wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_tmp) ||
+	    rdmsr_safe(MSR_IA32_PQR_ASSOC, &l, &h_new))
+		return false;
+
+	if (h_tmp != h_new)
+		return false;
+
+	wrmsr_safe(MSR_IA32_PQR_ASSOC, l, h_old);
+
+	boot_cpu_data.x86_cat_closs = 4;
+	boot_cpu_data.x86_cat_cbmlength = 20;
+
+	return true;
+}
+
 static inline bool cat_supported(struct cpuinfo_x86 *c)
 {
 	if (cpu_has(c, X86_FEATURE_CAT_L3))
 		return true;
 
+	/*
+	 * Probe test for Haswell CPUs.
+	 */
+	if (c->x86 == 0x6 && c->x86_model == 0x3f)
+		return hsw_probetest();
+
 	return false;
 }
 
-- 
1.9.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide
  2015-03-12 23:16 [PATCH V5 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
                   ` (5 preceding siblings ...)
  2015-03-12 23:16 ` [PATCH 6/7] x86/intel_rdt: Intel haswell CAT enumeration Vikas Shivappa
@ 2015-03-12 23:16 ` Vikas Shivappa
  2015-03-25 22:39   ` Marcelo Tosatti
  6 siblings, 1 reply; 20+ messages in thread
From: Vikas Shivappa @ 2015-03-12 23:16 UTC (permalink / raw)
  To: vikas.shivappa
  Cc: x86, linux-kernel, hpa, tglx, mingo, tj, peterz, matt.fleming,
	will.auld, glenn.p.williamson, kanaka.d.juvva, vikas.shivappa

This patch adds a description of Cache allocation technology, overview
of kernel implementation and usage of CAT cgroup interface.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
---
 Documentation/cgroups/rdt.txt | 183 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 183 insertions(+)
 create mode 100644 Documentation/cgroups/rdt.txt

diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
new file mode 100644
index 0000000..98eb4b8
--- /dev/null
+++ b/Documentation/cgroups/rdt.txt
@@ -0,0 +1,183 @@
+        RDT
+        ---
+
+Copyright (C) 2014 Intel Corporation
+Written by vikas.shivappa@linux.intel.com
+(based on contents and format from cpusets.txt)
+
+CONTENTS:
+=========
+
+1. Cache Allocation Technology
+  1.1 What is RDT and CAT ?
+  1.2 Why is CAT needed ?
+  1.3 CAT implementation overview
+  1.4 Assignment of CBM and CLOS
+  1.5 Scheduling and Context Switch
+2. Usage Examples and Syntax
+
+1. Cache Allocation Technology(CAT)
+===================================
+
+1.1 What is RDT and CAT
+-----------------------
+
+CAT is a part of Resource Director Technology(RDT) or Platform Shared
+resource control which provides support to control Platform shared
+resources like cache. Currently Cache is the only resource that is
+supported in RDT.
+More information can be found in the Intel SDM section 17.15.
+
+Cache Allocation Technology provides a way for the Software (OS/VMM)
+to restrict cache allocation to a defined 'subset' of cache which may
+be overlapping with other 'subsets'.  This feature is used when
+allocating a line in cache ie when pulling new data into the cache.
+The programming of the h/w is done via programming  MSRs.
+
+The different cache subsets are identified by CLOS identifier (class
+of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
+contiguous set of bits which defines the amount of cache resource that
+is available for each 'subset'.
+
+1.2 Why is CAT needed
+---------------------
+
+The CAT  enables more cache resources to be made available for higher
+priority applications based on guidance from the execution
+environment.
+
+The architecture also allows dynamically changing these subsets during
+runtime to further optimize the performance of the higher priority
+application with minimal degradation to the low priority app.
+Additionally, resources can be rebalanced for system throughput
+benefit.  (Refer to Section 17.15 in the Intel SDM)
+
+This technique may be useful in managing large computer systems which
+large LLC. Examples may be large servers running  instances of
+webservers or database servers. In such complex systems, these subsets
+can be used for more careful placing of the available cache
+resources.
+
+The CAT kernel patch would provide a basic kernel framework for users
+to be able to implement such cache subsets.
+
+1.3 CAT implementation Overview
+-------------------------------
+
+Kernel implements a cgroup subsystem to support cache allocation.
+
+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
+to the kernel and not exposed to user.  Each cgroup would have one CBM
+and would just represent one cache 'subset'.
+
+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
+cgroup never fails.  When a child cgroup is created it inherits the
+CLOSid and the CBM from its parent.  When a user changes the default
+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
+used before.  The changing of 'cbm' may fail with -ERRNOSPC once the
+kernel runs out of maximum CLOSids it can support.
+User can create as many cgroups as he wants but having different CBMs
+at the same time is restricted by the maximum number of CLOSids
+(multiple cgroups can have the same CBM).
+Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
+for each cgroup using a CLOSid.
+
+The tasks in the cgroup would get to fill the LLC cache represented by
+the cgroup's 'cbm' file.
+
+Root directory would have all available  bits set in 'cbm' file by
+default.
+
+1.4 Assignment of CBM,CLOS
+--------------------------
+
+The 'cbm' needs to be a  subset of the parent node's 'cbm'.
+Any contiguous subset of these bits(with a minimum of 2 bits) maybe
+set to indicate the cache mapping desired.  The 'cbm' between 2
+directories can overlap. The 'cbm' would represent the cache 'subset'
+of the CAT cgroup.  For ex: on a system with 16 bits of max cbm bits,
+if the directory has the least significant 4 bits set in its 'cbm'
+file(meaning the 'cbm' is just 0xf), it would be allocated the right
+quarter of the Last level cache which means the tasks belonging to
+this CAT cgroup can use the right quarter of the cache to fill. If it
+has the most significant 8 bits set ,it would be allocated the left
+half of the cache(8 bits  out of 16 represents 50%).
+
+The cache portion defined in the CBM file is available to all tasks
+within the cgroup to fill and these task are not allowed to allocate
+space in other parts of the cache.
+
+1.5 Scheduling and Context Switch
+---------------------------------
+
+During context switch kernel implements this by writing the
+CLOSid (internally maintained by kernel) of the cgroup to which the
+task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
+when there is a change in the CLOSid for the CPU in order to minimize
+the latency incurred during context switch.
+
+2. Usage examples and syntax
+============================
+
+To check if CAT was enabled on your system
+
+dmesg | grep -i intel_rdt
+should output : intel_rdt: cbmlength:xx, Closs:xx
+the length of cbm and CLOS should depend on the system you use.
+
+
+Following would mount the cache allocation cgroup subsystem and create
+2 directories. Please refer to Documentation/cgroups/cgroups.txt on
+details about how to use cgroups.
+
+  cd /sys/fs/cgroup
+  mkdir rdt
+  mount -t cgroup -ordt rdt /sys/fs/cgroup/rdt
+  cd rdt
+
+Create 2 rdt cgroups
+
+  mkdir group1
+  mkdir group2
+
+Following are some of the Files in the directory
+
+  ls
+  rdt.cbm
+  tasks
+
+Say if the cache is 2MB and cbm supports 16 bits, then setting the
+below allocates the 'right 1/4th(512KB)' of the cache to group2
+
+Edit the CBM for group2 to set the least significant 4 bits.  This
+allocates 'right quarter' of the cache.
+
+  cd group2
+  /bin/echo 0xf > cat.cbm
+
+
+Edit the CBM for group2 to set the least significant 8 bits.This
+allocates the right half of the cache to 'group2'.
+
+  cd group2
+  /bin/echo 0xff > rdt.cbm
+
+Assign tasks to the group2
+
+  /bin/echo PID1 > tasks
+  /bin/echo PID2 > tasks
+
+  Meaning now threads
+  PID1 and PID2 get to fill the 'right half' of
+  the cache as the belong to cgroup group2.
+
+Create a group under group2
+
+  cd group2
+  mkdir group21
+  cat rdt.cbm
+   0xff - inherits parents mask.
+
+  /bin/echo 0xfff > rdt.cbm - throws error as mask has to parent's mask's subset
+
-- 
1.9.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide
  2015-03-12 23:16 ` [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide Vikas Shivappa
@ 2015-03-25 22:39   ` Marcelo Tosatti
  2015-03-26 18:38     ` Vikas Shivappa
  0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2015-03-25 22:39 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, x86, linux-kernel, hpa, tglx, mingo, tj, peterz,
	matt.fleming, will.auld, glenn.p.williamson, kanaka.d.juvva

On Thu, Mar 12, 2015 at 04:16:07PM -0700, Vikas Shivappa wrote:
> This patch adds a description of Cache allocation technology, overview
> of kernel implementation and usage of CAT cgroup interface.
> 
> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> ---
>  Documentation/cgroups/rdt.txt | 183 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 183 insertions(+)
>  create mode 100644 Documentation/cgroups/rdt.txt
> 
> diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
> new file mode 100644
> index 0000000..98eb4b8
> --- /dev/null
> +++ b/Documentation/cgroups/rdt.txt
> @@ -0,0 +1,183 @@
> +        RDT
> +        ---
> +
> +Copyright (C) 2014 Intel Corporation
> +Written by vikas.shivappa@linux.intel.com
> +(based on contents and format from cpusets.txt)
> +
> +CONTENTS:
> +=========
> +
> +1. Cache Allocation Technology
> +  1.1 What is RDT and CAT ?
> +  1.2 Why is CAT needed ?
> +  1.3 CAT implementation overview
> +  1.4 Assignment of CBM and CLOS
> +  1.5 Scheduling and Context Switch
> +2. Usage Examples and Syntax
> +
> +1. Cache Allocation Technology(CAT)
> +===================================
> +
> +1.1 What is RDT and CAT
> +-----------------------
> +
> +CAT is a part of Resource Director Technology(RDT) or Platform Shared
> +resource control which provides support to control Platform shared
> +resources like cache. Currently Cache is the only resource that is
> +supported in RDT.
> +More information can be found in the Intel SDM section 17.15.
> +
> +Cache Allocation Technology provides a way for the Software (OS/VMM)
> +to restrict cache allocation to a defined 'subset' of cache which may
> +be overlapping with other 'subsets'.  This feature is used when
> +allocating a line in cache ie when pulling new data into the cache.
> +The programming of the h/w is done via programming  MSRs.
> +
> +The different cache subsets are identified by CLOS identifier (class
> +of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
> +contiguous set of bits which defines the amount of cache resource that
> +is available for each 'subset'.
> +
> +1.2 Why is CAT needed
> +---------------------
> +
> +The CAT  enables more cache resources to be made available for higher
> +priority applications based on guidance from the execution
> +environment.
> +
> +The architecture also allows dynamically changing these subsets during
> +runtime to further optimize the performance of the higher priority
> +application with minimal degradation to the low priority app.
> +Additionally, resources can be rebalanced for system throughput
> +benefit.  (Refer to Section 17.15 in the Intel SDM)
> +
> +This technique may be useful in managing large computer systems which
> +large LLC. Examples may be large servers running  instances of
> +webservers or database servers. In such complex systems, these subsets
> +can be used for more careful placing of the available cache
> +resources.
> +
> +The CAT kernel patch would provide a basic kernel framework for users
> +to be able to implement such cache subsets.
> +
> +1.3 CAT implementation Overview
> +-------------------------------
> +
> +Kernel implements a cgroup subsystem to support cache allocation.
> +
> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
> +to the kernel and not exposed to user.  Each cgroup would have one CBM
> +and would just represent one cache 'subset'.
> +
> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> +cgroup never fails.  When a child cgroup is created it inherits the
> +CLOSid and the CBM from its parent.  When a user changes the default
> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> +used before.  The changing of 'cbm' may fail with -ERRNOSPC once the
> +kernel runs out of maximum CLOSids it can support.
> +User can create as many cgroups as he wants but having different CBMs
> +at the same time is restricted by the maximum number of CLOSids
> +(multiple cgroups can have the same CBM).
> +Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
> +for each cgroup using a CLOSid.
> +
> +The tasks in the cgroup would get to fill the LLC cache represented by
> +the cgroup's 'cbm' file.
> +
> +Root directory would have all available  bits set in 'cbm' file by
> +default.
> +
> +1.4 Assignment of CBM,CLOS
> +--------------------------
> +
> +The 'cbm' needs to be a  subset of the parent node's 'cbm'.
> +Any contiguous subset of these bits(with a minimum of 2 bits) maybe
> +set to indicate the cache mapping desired.  The 'cbm' between 2
> +directories can overlap. The 'cbm' would represent the cache 'subset'
> +of the CAT cgroup.  For ex: on a system with 16 bits of max cbm bits,
> +if the directory has the least significant 4 bits set in its 'cbm'
> +file(meaning the 'cbm' is just 0xf), it would be allocated the right
> +quarter of the Last level cache which means the tasks belonging to
> +this CAT cgroup can use the right quarter of the cache to fill. If it
> +has the most significant 8 bits set ,it would be allocated the left
> +half of the cache(8 bits  out of 16 represents 50%).
> +
> +The cache portion defined in the CBM file is available to all tasks
> +within the cgroup to fill and these task are not allowed to allocate
> +space in other parts of the cache.

Is there a reason to expose the hardware interface rather 
than ratios to userspace ?

Say, i'd like to allocate 20% of L3 cache to cgroup A,
80% to cgroup B.

Well, you'd have to expose the shared percentages between
any two cgroups (that information is there in the
cbm bitmaps, but not in "ratios").

One problem i see with exposing cbm bitmasks is that on hardware
updates that change cache size or bitmask length, userspace must
recalculate the bitmaps.

Another is that its vendor dependant, while ratios (plus shared
information for two given cgroups) is not.


> +
> +1.5 Scheduling and Context Switch
> +---------------------------------
> +
> +During context switch kernel implements this by writing the
> +CLOSid (internally maintained by kernel) of the cgroup to which the
> +task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
> +when there is a change in the CLOSid for the CPU in order to minimize
> +the latency incurred during context switch.
> +
> +2. Usage examples and syntax
> +============================
> +
> +To check if CAT was enabled on your system
> +
> +dmesg | grep -i intel_rdt
> +should output : intel_rdt: cbmlength:xx, Closs:xx
> +the length of cbm and CLOS should depend on the system you use.
> +
> +
> +Following would mount the cache allocation cgroup subsystem and create
> +2 directories. Please refer to Documentation/cgroups/cgroups.txt on
> +details about how to use cgroups.
> +
> +  cd /sys/fs/cgroup
> +  mkdir rdt
> +  mount -t cgroup -ordt rdt /sys/fs/cgroup/rdt
> +  cd rdt
> +
> +Create 2 rdt cgroups
> +
> +  mkdir group1
> +  mkdir group2
> +
> +Following are some of the Files in the directory
> +
> +  ls
> +  rdt.cbm
> +  tasks
> +
> +Say if the cache is 2MB and cbm supports 16 bits, then setting the
> +below allocates the 'right 1/4th(512KB)' of the cache to group2
> +
> +Edit the CBM for group2 to set the least significant 4 bits.  This
> +allocates 'right quarter' of the cache.
> +
> +  cd group2
> +  /bin/echo 0xf > cat.cbm
> +
> +
> +Edit the CBM for group2 to set the least significant 8 bits.This
> +allocates the right half of the cache to 'group2'.
> +
> +  cd group2
> +  /bin/echo 0xff > rdt.cbm
> +
> +Assign tasks to the group2
> +
> +  /bin/echo PID1 > tasks
> +  /bin/echo PID2 > tasks
> +
> +  Meaning now threads
> +  PID1 and PID2 get to fill the 'right half' of
> +  the cache as the belong to cgroup group2.
> +
> +Create a group under group2
> +
> +  cd group2
> +  mkdir group21
> +  cat rdt.cbm
> +   0xff - inherits parents mask.
> +
> +  /bin/echo 0xfff > rdt.cbm - throws error as mask has to parent's mask's subset
> +
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide
  2015-03-25 22:39   ` Marcelo Tosatti
@ 2015-03-26 18:38     ` Vikas Shivappa
  2015-03-27  1:29       ` Marcelo Tosatti
  0 siblings, 1 reply; 20+ messages in thread
From: Vikas Shivappa @ 2015-03-26 18:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Vikas Shivappa, vikas.shivappa, x86, linux-kernel, hpa, tglx,
	mingo, tj, peterz, matt.fleming, will.auld, glenn.p.williamson,
	kanaka.d.juvva


Hello Marcelo,

On Wed, 25 Mar 2015, Marcelo Tosatti wrote:

> On Thu, Mar 12, 2015 at 04:16:07PM -0700, Vikas Shivappa wrote:
>> This patch adds a description of Cache allocation technology, overview
>> of kernel implementation and usage of CAT cgroup interface.
>>
>> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
>> ---
>>  Documentation/cgroups/rdt.txt | 183 ++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 183 insertions(+)
>>  create mode 100644 Documentation/cgroups/rdt.txt
>>
>> diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
>> new file mode 100644
>> index 0000000..98eb4b8
>> --- /dev/null
>> +++ b/Documentation/cgroups/rdt.txt
>> @@ -0,0 +1,183 @@
>> +        RDT
>> +        ---
>> +
>> +Copyright (C) 2014 Intel Corporation
>> +Written by vikas.shivappa@linux.intel.com
>> +(based on contents and format from cpusets.txt)
>> +
>> +CONTENTS:
>> +=========
>> +
>> +1. Cache Allocation Technology
>> +  1.1 What is RDT and CAT ?
>> +  1.2 Why is CAT needed ?
>> +  1.3 CAT implementation overview
>> +  1.4 Assignment of CBM and CLOS
>> +  1.5 Scheduling and Context Switch
>> +2. Usage Examples and Syntax
>> +
>> +1. Cache Allocation Technology(CAT)
>> +===================================
>> +
>> +1.1 What is RDT and CAT
>> +-----------------------
>> +
>> +CAT is a part of Resource Director Technology(RDT) or Platform Shared
>> +resource control which provides support to control Platform shared
>> +resources like cache. Currently Cache is the only resource that is
>> +supported in RDT.
>> +More information can be found in the Intel SDM section 17.15.
>> +
>> +Cache Allocation Technology provides a way for the Software (OS/VMM)
>> +to restrict cache allocation to a defined 'subset' of cache which may
>> +be overlapping with other 'subsets'.  This feature is used when
>> +allocating a line in cache ie when pulling new data into the cache.
>> +The programming of the h/w is done via programming  MSRs.
>> +
>> +The different cache subsets are identified by CLOS identifier (class
>> +of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
>> +contiguous set of bits which defines the amount of cache resource that
>> +is available for each 'subset'.
>> +
>> +1.2 Why is CAT needed
>> +---------------------
>> +
>> +The CAT  enables more cache resources to be made available for higher
>> +priority applications based on guidance from the execution
>> +environment.
>> +
>> +The architecture also allows dynamically changing these subsets during
>> +runtime to further optimize the performance of the higher priority
>> +application with minimal degradation to the low priority app.
>> +Additionally, resources can be rebalanced for system throughput
>> +benefit.  (Refer to Section 17.15 in the Intel SDM)
>> +
>> +This technique may be useful in managing large computer systems which
>> +large LLC. Examples may be large servers running  instances of
>> +webservers or database servers. In such complex systems, these subsets
>> +can be used for more careful placing of the available cache
>> +resources.
>> +
>> +The CAT kernel patch would provide a basic kernel framework for users
>> +to be able to implement such cache subsets.
>> +
>> +1.3 CAT implementation Overview
>> +-------------------------------
>> +
>> +Kernel implements a cgroup subsystem to support cache allocation.
>> +
>> +Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
>> +A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
>> +to the kernel and not exposed to user.  Each cgroup would have one CBM
>> +and would just represent one cache 'subset'.
>> +
>> +The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
>> +cgroup never fails.  When a child cgroup is created it inherits the
>> +CLOSid and the CBM from its parent.  When a user changes the default
>> +CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
>> +used before.  The changing of 'cbm' may fail with -ERRNOSPC once the
>> +kernel runs out of maximum CLOSids it can support.
>> +User can create as many cgroups as he wants but having different CBMs
>> +at the same time is restricted by the maximum number of CLOSids
>> +(multiple cgroups can have the same CBM).
>> +Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
>> +for each cgroup using a CLOSid.
>> +
>> +The tasks in the cgroup would get to fill the LLC cache represented by
>> +the cgroup's 'cbm' file.
>> +
>> +Root directory would have all available  bits set in 'cbm' file by
>> +default.
>> +
>> +1.4 Assignment of CBM,CLOS
>> +--------------------------
>> +
>> +The 'cbm' needs to be a  subset of the parent node's 'cbm'.
>> +Any contiguous subset of these bits(with a minimum of 2 bits) maybe
>> +set to indicate the cache mapping desired.  The 'cbm' between 2
>> +directories can overlap. The 'cbm' would represent the cache 'subset'
>> +of the CAT cgroup.  For ex: on a system with 16 bits of max cbm bits,
>> +if the directory has the least significant 4 bits set in its 'cbm'
>> +file(meaning the 'cbm' is just 0xf), it would be allocated the right
>> +quarter of the Last level cache which means the tasks belonging to
>> +this CAT cgroup can use the right quarter of the cache to fill. If it
>> +has the most significant 8 bits set ,it would be allocated the left
>> +half of the cache(8 bits  out of 16 represents 50%).
>> +
>> +The cache portion defined in the CBM file is available to all tasks
>> +within the cgroup to fill and these task are not allowed to allocate
>> +space in other parts of the cache.
>
> Is there a reason to expose the hardware interface rather
> than ratios to userspace ?
>
> Say, i'd like to allocate 20% of L3 cache to cgroup A,
> 80% to cgroup B.
>
> Well, you'd have to expose the shared percentages between
> any two cgroups (that information is there in the
> cbm bitmaps, but not in "ratios").
>
> One problem i see with exposing cbm bitmasks is that on hardware
> updates that change cache size or bitmask length, userspace must
> recalculate the bitmaps.
>
> Another is that its vendor dependant, while ratios (plus shared
> information for two given cgroups) is not.
>

Agree that this interface doesnot give options to directly allocate in terms of 
percentage . But note that specifying in bitmasks allows the user to 
allocate overlapping 
cache areas and also since we use cgroup we naturally follow the cgroup 
hierarchy. User should be able to convert the bitmasks into intended percentage 
or size values based on the other available cache size info in 
hooks like cpuinfo.

We discussed more on this before in the older patches and here is one thread 
where we discussed it for your reference - 
http://marc.info/?l=linux-kernel&m=142482002022543&w=2

Thanks,
Vikas

>
>> +
>> +1.5 Scheduling and Context Switch
>> +---------------------------------
>> +
>> +During context switch kernel implements this by writing the
>> +CLOSid (internally maintained by kernel) of the cgroup to which the
>> +task belongs to the CPU's IA32_PQR_ASSOC MSR. The MSR is only written
>> +when there is a change in the CLOSid for the CPU in order to minimize
>> +the latency incurred during context switch.
>> +
>> +2. Usage examples and syntax
>> +============================
>> +
>> +To check if CAT was enabled on your system
>> +
>> +dmesg | grep -i intel_rdt
>> +should output : intel_rdt: cbmlength:xx, Closs:xx
>> +the length of cbm and CLOS should depend on the system you use.
>> +
>> +
>> +Following would mount the cache allocation cgroup subsystem and create
>> +2 directories. Please refer to Documentation/cgroups/cgroups.txt on
>> +details about how to use cgroups.
>> +
>> +  cd /sys/fs/cgroup
>> +  mkdir rdt
>> +  mount -t cgroup -ordt rdt /sys/fs/cgroup/rdt
>> +  cd rdt
>> +
>> +Create 2 rdt cgroups
>> +
>> +  mkdir group1
>> +  mkdir group2
>> +
>> +Following are some of the Files in the directory
>> +
>> +  ls
>> +  rdt.cbm
>> +  tasks
>> +
>> +Say if the cache is 2MB and cbm supports 16 bits, then setting the
>> +below allocates the 'right 1/4th(512KB)' of the cache to group2
>> +
>> +Edit the CBM for group2 to set the least significant 4 bits.  This
>> +allocates 'right quarter' of the cache.
>> +
>> +  cd group2
>> +  /bin/echo 0xf > cat.cbm
>> +
>> +
>> +Edit the CBM for group2 to set the least significant 8 bits.This
>> +allocates the right half of the cache to 'group2'.
>> +
>> +  cd group2
>> +  /bin/echo 0xff > rdt.cbm
>> +
>> +Assign tasks to the group2
>> +
>> +  /bin/echo PID1 > tasks
>> +  /bin/echo PID2 > tasks
>> +
>> +  Meaning now threads
>> +  PID1 and PID2 get to fill the 'right half' of
>> +  the cache as the belong to cgroup group2.
>> +
>> +Create a group under group2
>> +
>> +  cd group2
>> +  mkdir group21
>> +  cat rdt.cbm
>> +   0xff - inherits parents mask.
>> +
>> +  /bin/echo 0xfff > rdt.cbm - throws error as mask has to parent's mask's subset
>> +
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide
  2015-03-26 18:38     ` Vikas Shivappa
@ 2015-03-27  1:29       ` Marcelo Tosatti
  2015-03-31  1:17         ` Marcelo Tosatti
                           ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2015-03-27  1:29 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, x86, linux-kernel, hpa, tglx, mingo, tj, peterz,
	matt.fleming, will.auld, glenn.p.williamson, kanaka.d.juvva

On Thu, Mar 26, 2015 at 11:38:59AM -0700, Vikas Shivappa wrote:
> 
> Hello Marcelo,

Hi Vikas,

> On Wed, 25 Mar 2015, Marcelo Tosatti wrote:
> 
> >On Thu, Mar 12, 2015 at 04:16:07PM -0700, Vikas Shivappa wrote:
> >>This patch adds a description of Cache allocation technology, overview
> >>of kernel implementation and usage of CAT cgroup interface.
> >>
> >>Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> >>---
> >> Documentation/cgroups/rdt.txt | 183 ++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 183 insertions(+)
> >> create mode 100644 Documentation/cgroups/rdt.txt
> >>
> >>diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
> >>new file mode 100644
> >>index 0000000..98eb4b8
> >>--- /dev/null
> >>+++ b/Documentation/cgroups/rdt.txt
> >>@@ -0,0 +1,183 @@
> >>+        RDT
> >>+        ---
> >>+
> >>+Copyright (C) 2014 Intel Corporation
> >>+Written by vikas.shivappa@linux.intel.com
> >>+(based on contents and format from cpusets.txt)
> >>+
> >>+CONTENTS:
> >>+=========
> >>+
> >>+1. Cache Allocation Technology
> >>+  1.1 What is RDT and CAT ?
> >>+  1.2 Why is CAT needed ?
> >>+  1.3 CAT implementation overview
> >>+  1.4 Assignment of CBM and CLOS
> >>+  1.5 Scheduling and Context Switch
> >>+2. Usage Examples and Syntax
> >>+
> >>+1. Cache Allocation Technology(CAT)
> >>+===================================
> >>+
> >>+1.1 What is RDT and CAT
> >>+-----------------------
> >>+
> >>+CAT is a part of Resource Director Technology(RDT) or Platform Shared
> >>+resource control which provides support to control Platform shared
> >>+resources like cache. Currently Cache is the only resource that is
> >>+supported in RDT.
> >>+More information can be found in the Intel SDM section 17.15.
> >>+
> >>+Cache Allocation Technology provides a way for the Software (OS/VMM)
> >>+to restrict cache allocation to a defined 'subset' of cache which may
> >>+be overlapping with other 'subsets'.  This feature is used when
> >>+allocating a line in cache ie when pulling new data into the cache.
> >>+The programming of the h/w is done via programming  MSRs.
> >>+
> >>+The different cache subsets are identified by CLOS identifier (class
> >>+of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
> >>+contiguous set of bits which defines the amount of cache resource that
> >>+is available for each 'subset'.
> >>+
> >>+1.2 Why is CAT needed
> >>+---------------------
> >>+
> >>+The CAT  enables more cache resources to be made available for higher
> >>+priority applications based on guidance from the execution
> >>+environment.
> >>+
> >>+The architecture also allows dynamically changing these subsets during
> >>+runtime to further optimize the performance of the higher priority
> >>+application with minimal degradation to the low priority app.
> >>+Additionally, resources can be rebalanced for system throughput
> >>+benefit.  (Refer to Section 17.15 in the Intel SDM)
> >>+
> >>+This technique may be useful in managing large computer systems which
> >>+large LLC. Examples may be large servers running  instances of
> >>+webservers or database servers. In such complex systems, these subsets
> >>+can be used for more careful placing of the available cache
> >>+resources.
> >>+
> >>+The CAT kernel patch would provide a basic kernel framework for users
> >>+to be able to implement such cache subsets.
> >>+
> >>+1.3 CAT implementation Overview
> >>+-------------------------------
> >>+
> >>+Kernel implements a cgroup subsystem to support cache allocation.
> >>+
> >>+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
> >>+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
> >>+to the kernel and not exposed to user.  Each cgroup would have one CBM
> >>+and would just represent one cache 'subset'.
> >>+
> >>+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> >>+cgroup never fails.  When a child cgroup is created it inherits the
> >>+CLOSid and the CBM from its parent.  When a user changes the default
> >>+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> >>+used before.  The changing of 'cbm' may fail with -ERRNOSPC once the
> >>+kernel runs out of maximum CLOSids it can support.
> >>+User can create as many cgroups as he wants but having different CBMs
> >>+at the same time is restricted by the maximum number of CLOSids
> >>+(multiple cgroups can have the same CBM).
> >>+Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
> >>+for each cgroup using a CLOSid.
> >>+
> >>+The tasks in the cgroup would get to fill the LLC cache represented by
> >>+the cgroup's 'cbm' file.
> >>+
> >>+Root directory would have all available  bits set in 'cbm' file by
> >>+default.
> >>+
> >>+1.4 Assignment of CBM,CLOS
> >>+--------------------------
> >>+
> >>+The 'cbm' needs to be a  subset of the parent node's 'cbm'.
> >>+Any contiguous subset of these bits(with a minimum of 2 bits) maybe
> >>+set to indicate the cache mapping desired.  The 'cbm' between 2
> >>+directories can overlap. The 'cbm' would represent the cache 'subset'
> >>+of the CAT cgroup.  For ex: on a system with 16 bits of max cbm bits,
> >>+if the directory has the least significant 4 bits set in its 'cbm'
> >>+file(meaning the 'cbm' is just 0xf), it would be allocated the right
> >>+quarter of the Last level cache which means the tasks belonging to
> >>+this CAT cgroup can use the right quarter of the cache to fill. If it
> >>+has the most significant 8 bits set ,it would be allocated the left
> >>+half of the cache(8 bits  out of 16 represents 50%).
> >>+
> >>+The cache portion defined in the CBM file is available to all tasks
> >>+within the cgroup to fill and these task are not allowed to allocate
> >>+space in other parts of the cache.
> >
> >Is there a reason to expose the hardware interface rather
> >than ratios to userspace ?
> >
> >Say, i'd like to allocate 20% of L3 cache to cgroup A,
> >80% to cgroup B.
> >
> >Well, you'd have to expose the shared percentages between
> >any two cgroups (that information is there in the
> >cbm bitmaps, but not in "ratios").
> >
> >One problem i see with exposing cbm bitmasks is that on hardware
> >updates that change cache size or bitmask length, userspace must
> >recalculate the bitmaps.
> >
> >Another is that its vendor dependant, while ratios (plus shared
> >information for two given cgroups) is not.
> >
> 
> Agree that this interface doesnot give options to directly allocate
> in terms of percentage . But note that specifying in bitmasks allows
> the user to allocate overlapping cache areas and also since we use
> cgroup we naturally follow the cgroup hierarchy. User should be able
> to convert the bitmasks into intended percentage or size values
> based on the other available cache size info in hooks like cpuinfo.
> 
> We discussed more on this before in the older patches and here is
> one thread where we discussed it for your reference -
> http://marc.info/?l=linux-kernel&m=142482002022543&w=2
> 
> Thanks,
> Vikas

I can't find any discussion relating to exposing the CBM interface
directly to userspace in that thread ?

Cpu.shares is written in ratio form, which is much more natural.
Do you see any advantage in maintaining the 

(ratio -> cbm bitmasks) 

translation in userspace rather than in the kernel ? 

What about something like:


		      root cgroup
		   /		  \
		  /		    \
		/		      \
	cgroupA-80			cgroupB-30


So that whatever exceeds 100% is the ratio of cache 
shared at that level (cgroup A and B share 10% of cache 
at that level).

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html

cpu — the cpu.shares parameter determines the share of CPU resources
available to each process in all cgroups. Setting the parameter to 250,
250, and 500 in the finance, sales, and engineering cgroups respectively
means that processes started in these groups will split the resources
with a 1:1:2 ratio. Note that when a single process is running, it
consumes as much CPU as necessary no matter which cgroup it is placed
in. The CPU limitation only comes into effect when two or more processes
compete for CPU resources. 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide
  2015-03-27  1:29       ` Marcelo Tosatti
@ 2015-03-31  1:17         ` Marcelo Tosatti
  2015-03-31 17:27         ` Vikas Shivappa
  2015-03-31 17:32         ` Vikas Shivappa
  2 siblings, 0 replies; 20+ messages in thread
From: Marcelo Tosatti @ 2015-03-31  1:17 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, x86, linux-kernel, hpa, tglx, mingo, tj, peterz,
	matt.fleming, will.auld, glenn.p.williamson, kanaka.d.juvva

On Thu, Mar 26, 2015 at 10:29:27PM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 26, 2015 at 11:38:59AM -0700, Vikas Shivappa wrote:
> > 
> > Hello Marcelo,
> 
> Hi Vikas,
> 
> > On Wed, 25 Mar 2015, Marcelo Tosatti wrote:
> > 
> > >On Thu, Mar 12, 2015 at 04:16:07PM -0700, Vikas Shivappa wrote:
> > >>This patch adds a description of Cache allocation technology, overview
> > >>of kernel implementation and usage of CAT cgroup interface.
> > >>
> > >>Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> > >>---
> > >> Documentation/cgroups/rdt.txt | 183 ++++++++++++++++++++++++++++++++++++++++++
> > >> 1 file changed, 183 insertions(+)
> > >> create mode 100644 Documentation/cgroups/rdt.txt
> > >>
> > >>diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
> > >>new file mode 100644
> > >>index 0000000..98eb4b8
> > >>--- /dev/null
> > >>+++ b/Documentation/cgroups/rdt.txt
> > >>@@ -0,0 +1,183 @@
> > >>+        RDT
> > >>+        ---
> > >>+
> > >>+Copyright (C) 2014 Intel Corporation
> > >>+Written by vikas.shivappa@linux.intel.com
> > >>+(based on contents and format from cpusets.txt)
> > >>+
> > >>+CONTENTS:
> > >>+=========
> > >>+
> > >>+1. Cache Allocation Technology
> > >>+  1.1 What is RDT and CAT ?
> > >>+  1.2 Why is CAT needed ?
> > >>+  1.3 CAT implementation overview
> > >>+  1.4 Assignment of CBM and CLOS
> > >>+  1.5 Scheduling and Context Switch
> > >>+2. Usage Examples and Syntax
> > >>+
> > >>+1. Cache Allocation Technology(CAT)
> > >>+===================================
> > >>+
> > >>+1.1 What is RDT and CAT
> > >>+-----------------------
> > >>+
> > >>+CAT is a part of Resource Director Technology(RDT) or Platform Shared
> > >>+resource control which provides support to control Platform shared
> > >>+resources like cache. Currently Cache is the only resource that is
> > >>+supported in RDT.
> > >>+More information can be found in the Intel SDM section 17.15.
> > >>+
> > >>+Cache Allocation Technology provides a way for the Software (OS/VMM)
> > >>+to restrict cache allocation to a defined 'subset' of cache which may
> > >>+be overlapping with other 'subsets'.  This feature is used when
> > >>+allocating a line in cache ie when pulling new data into the cache.
> > >>+The programming of the h/w is done via programming  MSRs.
> > >>+
> > >>+The different cache subsets are identified by CLOS identifier (class
> > >>+of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
> > >>+contiguous set of bits which defines the amount of cache resource that
> > >>+is available for each 'subset'.
> > >>+
> > >>+1.2 Why is CAT needed
> > >>+---------------------
> > >>+
> > >>+The CAT  enables more cache resources to be made available for higher
> > >>+priority applications based on guidance from the execution
> > >>+environment.
> > >>+
> > >>+The architecture also allows dynamically changing these subsets during
> > >>+runtime to further optimize the performance of the higher priority
> > >>+application with minimal degradation to the low priority app.
> > >>+Additionally, resources can be rebalanced for system throughput
> > >>+benefit.  (Refer to Section 17.15 in the Intel SDM)
> > >>+
> > >>+This technique may be useful in managing large computer systems which
> > >>+large LLC. Examples may be large servers running  instances of
> > >>+webservers or database servers. In such complex systems, these subsets
> > >>+can be used for more careful placing of the available cache
> > >>+resources.
> > >>+
> > >>+The CAT kernel patch would provide a basic kernel framework for users
> > >>+to be able to implement such cache subsets.
> > >>+
> > >>+1.3 CAT implementation Overview
> > >>+-------------------------------
> > >>+
> > >>+Kernel implements a cgroup subsystem to support cache allocation.
> > >>+
> > >>+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
> > >>+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
> > >>+to the kernel and not exposed to user.  Each cgroup would have one CBM
> > >>+and would just represent one cache 'subset'.
> > >>+
> > >>+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> > >>+cgroup never fails.  When a child cgroup is created it inherits the
> > >>+CLOSid and the CBM from its parent.  When a user changes the default
> > >>+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> > >>+used before.  The changing of 'cbm' may fail with -ERRNOSPC once the
> > >>+kernel runs out of maximum CLOSids it can support.
> > >>+User can create as many cgroups as he wants but having different CBMs
> > >>+at the same time is restricted by the maximum number of CLOSids
> > >>+(multiple cgroups can have the same CBM).
> > >>+Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
> > >>+for each cgroup using a CLOSid.
> > >>+
> > >>+The tasks in the cgroup would get to fill the LLC cache represented by
> > >>+the cgroup's 'cbm' file.
> > >>+
> > >>+Root directory would have all available  bits set in 'cbm' file by
> > >>+default.
> > >>+
> > >>+1.4 Assignment of CBM,CLOS
> > >>+--------------------------
> > >>+
> > >>+The 'cbm' needs to be a  subset of the parent node's 'cbm'.
> > >>+Any contiguous subset of these bits(with a minimum of 2 bits) maybe
> > >>+set to indicate the cache mapping desired.  The 'cbm' between 2
> > >>+directories can overlap. The 'cbm' would represent the cache 'subset'
> > >>+of the CAT cgroup.  For ex: on a system with 16 bits of max cbm bits,
> > >>+if the directory has the least significant 4 bits set in its 'cbm'
> > >>+file(meaning the 'cbm' is just 0xf), it would be allocated the right
> > >>+quarter of the Last level cache which means the tasks belonging to
> > >>+this CAT cgroup can use the right quarter of the cache to fill. If it
> > >>+has the most significant 8 bits set ,it would be allocated the left
> > >>+half of the cache(8 bits  out of 16 represents 50%).
> > >>+
> > >>+The cache portion defined in the CBM file is available to all tasks
> > >>+within the cgroup to fill and these task are not allowed to allocate
> > >>+space in other parts of the cache.
> > >
> > >Is there a reason to expose the hardware interface rather
> > >than ratios to userspace ?
> > >
> > >Say, i'd like to allocate 20% of L3 cache to cgroup A,
> > >80% to cgroup B.
> > >
> > >Well, you'd have to expose the shared percentages between
> > >any two cgroups (that information is there in the
> > >cbm bitmaps, but not in "ratios").
> > >
> > >One problem i see with exposing cbm bitmasks is that on hardware
> > >updates that change cache size or bitmask length, userspace must
> > >recalculate the bitmaps.
> > >
> > >Another is that its vendor dependant, while ratios (plus shared
> > >information for two given cgroups) is not.
> > >
> > 
> > Agree that this interface doesnot give options to directly allocate
> > in terms of percentage . But note that specifying in bitmasks allows
> > the user to allocate overlapping cache areas and also since we use
> > cgroup we naturally follow the cgroup hierarchy. User should be able
> > to convert the bitmasks into intended percentage or size values
> > based on the other available cache size info in hooks like cpuinfo.
> > 
> > We discussed more on this before in the older patches and here is
> > one thread where we discussed it for your reference -
> > http://marc.info/?l=linux-kernel&m=142482002022543&w=2
> > 
> > Thanks,
> > Vikas
> 
> I can't find any discussion relating to exposing the CBM interface
> directly to userspace in that thread ?
> 
> Cpu.shares is written in ratio form, which is much more natural.
> Do you see any advantage in maintaining the 
> 
> (ratio -> cbm bitmasks) 
> 
> translation in userspace rather than in the kernel ? 
> 
> What about something like:
> 
> 
> 		      root cgroup
> 		   /		  \
> 		  /		    \
> 		/		      \
> 	cgroupA-80			cgroupB-30
> 
> 
> So that whatever exceeds 100% is the ratio of cache 
> shared at that level (cgroup A and B share 10% of cache 
> at that level).
> 
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html
> 
> cpu — the cpu.shares parameter determines the share of CPU resources
> available to each process in all cgroups. Setting the parameter to 250,
> 250, and 500 in the finance, sales, and engineering cgroups respectively
> means that processes started in these groups will split the resources
> with a 1:1:2 ratio. Note that when a single process is running, it
> consumes as much CPU as necessary no matter which cgroup it is placed
> in. The CPU limitation only comes into effect when two or more processes
> compete for CPU resources. 

Vikas,

I see the following resource specifications from the POV of a user/admin:

1) Ratios. 

X%/Y%, as discussed above.

2) Specific kilobyte values.

In accord with the rest of cgroups, allow specific kilobyte
specification. See limit_in_bytes, for example, from

https://www.kernel.org/doc/Documentation/cgroups/memory.txt

Of course you would have to convert to way units, but i see
two use-cases here:

	- User wants application to not reclaim more than
	 given number of kilobytes of LLC cache.
	- User wants application to be guaranteed a given
	  amount of kilobytes of LLC, even across processor changes.

Again, some precision is lost with LLC.

3) Per-CPU differentiation 

The current patchset deals with the following use-case suboptimally:


	CPU1-4				CPU5-8

	die1				die2



* Task groupA is isolated to CPU-8 (die2).
* Task groupA has 50% cache reserved.
* Task groupB can reclaim into 50% cache.
* Task groupB can reclaim into 100% of cache 
of die1.

I suppose this is a common scenario which is not handled by 
the current patchset (you would have task groupB use only 50% 
of cache of die1).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide
  2015-03-27  1:29       ` Marcelo Tosatti
  2015-03-31  1:17         ` Marcelo Tosatti
@ 2015-03-31 17:27         ` Vikas Shivappa
  2015-03-31 22:56           ` Marcelo Tosatti
  2015-07-28 23:37           ` Marcelo Tosatti
  2015-03-31 17:32         ` Vikas Shivappa
  2 siblings, 2 replies; 20+ messages in thread
From: Vikas Shivappa @ 2015-03-31 17:27 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Vikas Shivappa, Vikas Shivappa, x86, linux-kernel, hpa, tglx,
	mingo, tj, peterz, matt.fleming, will.auld, glenn.p.williamson,
	kanaka.d.juvva

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3730 bytes --]



On Thu, 26 Mar 2015, Marcelo Tosatti wrote:

>
> I can't find any discussion relating to exposing the CBM interface
> directly to userspace in that thread ?
>
> Cpu.shares is written in ratio form, which is much more natural.
> Do you see any advantage in maintaining the
>
> (ratio -> cbm bitmasks)
>
> translation in userspace rather than in the kernel ?
>
> What about something like:
>
>
> 		      root cgroup
> 		   /		  \
> 		  /		    \
> 		/		      \
> 	cgroupA-80			cgroupB-30
>
>
> So that whatever exceeds 100% is the ratio of cache
> shared at that level (cgroup A and B share 10% of cache
> at that level).

But this also means the 2 groups share all of the cache ?

Specifying the amount of bits to be shared lets you specify the exact cache area 
where you want to share and also when your total occupancy does not cover all of 
the cache. For ex: it gets more complex when you want to share say only the left 
quarter of the cache. cgroupA gets left half and cgroup gets left quarter. The 
bitmask aligns with how the h/w is designed to share the cache which gives you 
flexibility to define any specific overlapping areas of the cache.

>
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html
>
> cpu — the cpu.shares parameter determines the share of CPU resources
> available to each process in all cgroups. Setting the parameter to 250,
> 250, and 500 in the finance, sales, and engineering cgroups respectively
> means that processes started in these groups will split the resources
> with a 1:1:2 ratio. Note that when a single process is running, it
> consumes as much CPU as necessary no matter which cgroup it is placed
> in. The CPU limitation only comes into effect when two or more processes
> compete for CPU resources.
>
>

These are more defined in terms 
of how many cache lines (or how many cache ways) they can use and would be 
difficult to define them in terms of percentage. In contrast the cpu share is a 
time shared thing and is much more granular where as here its not , its 
occupancy in terms of cache lines/ways.. (however this is not really defined as 
a restriction but thats the way it is now).
Also note that the granularity of 
the bitmasks define the granularity of the 
percentages and in some SKUs the granularity is 2b and not 1b.. So 
technically you wont be 
able to even allocate percentage of cache even in 10% granularity for most of 
the cases (if there are 30MB and 25 ways like in one of hsw SKU) and this will 
vary for different SKUs which makes it more complicated for users. However 
the user library is free to define own interface based on the 
underlying cgroup interface say for example you never care about the 
overlapping and using it for a specific SKU etc.. The underlying cgroup 
framework is meant to be  generic for all SKus and used for most of the use 
cases.

Also at this point I see a lot of enterprise and and other users already using 
the cgroup interface or shown interest in the same.
However I see your point where you indicate the ease 
with which user can specify in size/percentage which he might be used to 
doing for other resources rather than bits where he 
needs to get an idea size by calculating it seperately - But again note that you 
may not be able to define percentages in many scenarios like the one above. And 
another question would be we would need to convince the users to adapt to the 
modified percentage user model (ex: like the one you say above where percentage 
- 100 is the one thats shared)
I can review this requirements and others 
I have received and get back to see the closest that can be done if possible.

Thanks,
Vikas

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide
  2015-03-27  1:29       ` Marcelo Tosatti
  2015-03-31  1:17         ` Marcelo Tosatti
  2015-03-31 17:27         ` Vikas Shivappa
@ 2015-03-31 17:32         ` Vikas Shivappa
  2 siblings, 0 replies; 20+ messages in thread
From: Vikas Shivappa @ 2015-03-31 17:32 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Vikas Shivappa, Vikas Shivappa, x86, linux-kernel, hpa, tglx,
	mingo, tj, peterz, Matt Fleming, Auld, Will, Williamson, Glenn P,
	Juvva, Kanaka D



On Thu, 26 Mar 2015, Marcelo Tosatti wrote:

>
> I can't find any discussion relating to exposing the CBM interface
> directly to userspace in that thread ?

It was the same version V4 as above but a different subthread i think.. here you 
go anyways -

https://lkml.kernel.org/r/alpine.DEB.2.10.1502271155180.31647@vshiva-Udesk




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide
  2015-03-31 17:27         ` Vikas Shivappa
@ 2015-03-31 22:56           ` Marcelo Tosatti
  2015-04-01 18:20             ` Vikas Shivappa
  2015-07-28 23:37           ` Marcelo Tosatti
  1 sibling, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2015-03-31 22:56 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, x86, linux-kernel, hpa, tglx, mingo, tj, peterz,
	matt.fleming, will.auld, glenn.p.williamson, kanaka.d.juvva

On Tue, Mar 31, 2015 at 10:27:32AM -0700, Vikas Shivappa wrote:
> 
> 
> On Thu, 26 Mar 2015, Marcelo Tosatti wrote:
> 
> >
> >I can't find any discussion relating to exposing the CBM interface
> >directly to userspace in that thread ?
> >
> >Cpu.shares is written in ratio form, which is much more natural.
> >Do you see any advantage in maintaining the
> >
> >(ratio -> cbm bitmasks)
> >
> >translation in userspace rather than in the kernel ?
> >
> >What about something like:
> >
> >
> >		      root cgroup
> >		   /		  \
> >		  /		    \
> >		/		      \
> >	cgroupA-80			cgroupB-30
> >
> >
> >So that whatever exceeds 100% is the ratio of cache
> >shared at that level (cgroup A and B share 10% of cache
> >at that level).
> 
> But this also means the 2 groups share all of the cache ?
> 
> Specifying the amount of bits to be shared lets you specify the
> exact cache area where you want to share and also when your total
> occupancy does not cover all of the cache. For ex: it gets more
> complex when you want to share say only the left quarter of the
> cache. cgroupA gets left half and cgroup gets left quarter. The
> bitmask aligns with how the h/w is designed to share the cache which
> gives you flexibility to define any specific overlapping areas of
> the cache.

> >https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html
> >
> >cpu — the cpu.shares parameter determines the share of CPU resources
> >available to each process in all cgroups. Setting the parameter to 250,
> >250, and 500 in the finance, sales, and engineering cgroups respectively
> >means that processes started in these groups will split the resources
> >with a 1:1:2 ratio. Note that when a single process is running, it
> >consumes as much CPU as necessary no matter which cgroup it is placed
> >in. The CPU limitation only comes into effect when two or more processes
> >compete for CPU resources.
> >
> >
> 
> These are more defined in terms of how many cache lines (or how many
> cache ways) they can use and would be difficult to define them in
> terms of percentage. In contrast the cpu share is a time shared
> thing and is much more granular where as here its not , its
> occupancy in terms of cache lines/ways.. (however this is not really
> defined as a restriction but thats the way it is now).
> Also note that the granularity of the bitmasks define the
> granularity of the percentages and in some SKUs the granularity is
> 2b and not 1b.. So technically you wont be able to even allocate
> percentage of cache even in 10% granularity for most of the cases
> (if there are 30MB and 25 ways like in one of hsw SKU) and this will
> vary for different SKUs which makes it more complicated for users.
> However the user library is free to define own interface based on
> the underlying cgroup interface say for example you never care about
> the overlapping and using it for a specific SKU etc.. The underlying
> cgroup framework is meant to be  generic for all SKus and used for
> most of the use cases.
> 
> Also at this point I see a lot of enterprise and and other users
> already using the cgroup interface or shown interest in the same.
> However I see your point where you indicate the ease with which user
> can specify in size/percentage which he might be used to doing for
> other resources rather than bits where he needs to get an idea size
> by calculating it seperately - But again note that you may not be
> able to define percentages in many scenarios like the one above. And
> another question would be we would need to convince the users to
> adapt to the modified percentage user model (ex: like the one you
> say above where percentage - 100 is the one thats shared)
> I can review this requirements and others I have received and get
> back to see the closest that can be done if possible.
> 
> Thanks,
> Vikas

Vikas,

I see. Don't have anything against performing the translation in userspace
(i agree userspace should be able to allow ratios and specific
minimum/maximum counts). Can you please export the relevant information
in files in /sys or cgroups itself rather than requiring userspace to
parse CPUID etc? Including the EBX register from CPUID(EAX=10H, ECX=1),
which is necessary to implement "reserved LLC" properly.

The current interface is unable to handle the cross CPU case, though.
It would be necessary to expose per-socket masks. 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide
  2015-03-31 22:56           ` Marcelo Tosatti
@ 2015-04-01 18:20             ` Vikas Shivappa
  0 siblings, 0 replies; 20+ messages in thread
From: Vikas Shivappa @ 2015-04-01 18:20 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Vikas Shivappa, Vikas Shivappa, x86, linux-kernel, hpa, tglx,
	mingo, tj, peterz, matt.fleming, will.auld, glenn.p.williamson,
	kanaka.d.juvva

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5417 bytes --]



On Tue, 31 Mar 2015, Marcelo Tosatti wrote:

> On Tue, Mar 31, 2015 at 10:27:32AM -0700, Vikas Shivappa wrote:
>>
>>
>> On Thu, 26 Mar 2015, Marcelo Tosatti wrote:
>>
>>>
>>> I can't find any discussion relating to exposing the CBM interface
>>> directly to userspace in that thread ?
>>>
>>> Cpu.shares is written in ratio form, which is much more natural.
>>> Do you see any advantage in maintaining the
>>>
>>> (ratio -> cbm bitmasks)
>>>
>>> translation in userspace rather than in the kernel ?
>>>
>>> What about something like:
>>>
>>>
>>> 		      root cgroup
>>> 		   /		  \
>>> 		  /		    \
>>> 		/		      \
>>> 	cgroupA-80			cgroupB-30
>>>
>>>
>>> So that whatever exceeds 100% is the ratio of cache
>>> shared at that level (cgroup A and B share 10% of cache
>>> at that level).
>>
>> But this also means the 2 groups share all of the cache ?
>>
>> Specifying the amount of bits to be shared lets you specify the
>> exact cache area where you want to share and also when your total
>> occupancy does not cover all of the cache. For ex: it gets more
>> complex when you want to share say only the left quarter of the
>> cache. cgroupA gets left half and cgroup gets left quarter. The
>> bitmask aligns with how the h/w is designed to share the cache which
>> gives you flexibility to define any specific overlapping areas of
>> the cache.
>
>>> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html
>>>
>>> cpu — the cpu.shares parameter determines the share of CPU resources
>>> available to each process in all cgroups. Setting the parameter to 250,
>>> 250, and 500 in the finance, sales, and engineering cgroups respectively
>>> means that processes started in these groups will split the resources
>>> with a 1:1:2 ratio. Note that when a single process is running, it
>>> consumes as much CPU as necessary no matter which cgroup it is placed
>>> in. The CPU limitation only comes into effect when two or more processes
>>> compete for CPU resources.
>>>
>>>
>>
>> These are more defined in terms of how many cache lines (or how many
>> cache ways) they can use and would be difficult to define them in
>> terms of percentage. In contrast the cpu share is a time shared
>> thing and is much more granular where as here its not , its
>> occupancy in terms of cache lines/ways.. (however this is not really
>> defined as a restriction but thats the way it is now).
>> Also note that the granularity of the bitmasks define the
>> granularity of the percentages and in some SKUs the granularity is
>> 2b and not 1b.. So technically you wont be able to even allocate
>> percentage of cache even in 10% granularity for most of the cases
>> (if there are 30MB and 25 ways like in one of hsw SKU) and this will
>> vary for different SKUs which makes it more complicated for users.
>> However the user library is free to define own interface based on
>> the underlying cgroup interface say for example you never care about
>> the overlapping and using it for a specific SKU etc.. The underlying
>> cgroup framework is meant to be  generic for all SKus and used for
>> most of the use cases.
>>
>> Also at this point I see a lot of enterprise and and other users
>> already using the cgroup interface or shown interest in the same.
>> However I see your point where you indicate the ease with which user
>> can specify in size/percentage which he might be used to doing for
>> other resources rather than bits where he needs to get an idea size
>> by calculating it seperately - But again note that you may not be
>> able to define percentages in many scenarios like the one above. And
>> another question would be we would need to convince the users to
>> adapt to the modified percentage user model (ex: like the one you
>> say above where percentage - 100 is the one thats shared)
>> I can review this requirements and others I have received and get
>> back to see the closest that can be done if possible.
>>
>> Thanks,
>> Vikas
>
> Vikas,
>
> I see. Don't have anything against performing the translation in userspace
> (i agree userspace should be able to allow ratios and specific
> minimum/maximum counts). Can you please export the relevant information
> in files in /sys or cgroups itself rather than requiring userspace to
> parse CPUID etc? Including the EBX register from CPUID(EAX=10H, ECX=1),
> which is necessary to implement "reserved LLC" properly.
>
> The current interface is unable to handle the cross CPU case, though.
> It would be necessary to expose per-socket masks.
>
>

Marcelo,

The current package supports per-socket updates to masks. Although the CLOSids 
are allocated globally just like in CMT and not per package.

The maximum bitmask is the root node's bitmask which is exposed already. The 
number of CLOSids are not exposed as kernel internally optimizes its usage and 
that should not end up giving a wrong picture for the user. For ex: if the 
number of CLOSids available is say 4 - the kernel could actually allocate them 
to more cgroups than just 4 cgroups , and this logic may change based on other 
features that my be added in the cgroup or depending on features available in 
the SKUs .. However with CAT cgroups an error is 
returned once kernel runs out of CLOSids. I am still reviewing this requirement 
with respect to the closids and will send an update soon.

Thanks,
Vikas

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT
  2015-03-12 23:16 ` [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT Vikas Shivappa
@ 2015-04-09 20:56   ` Marcelo Tosatti
  2015-04-13  2:36     ` Vikas Shivappa
  0 siblings, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2015-04-09 20:56 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: vikas.shivappa, x86, linux-kernel, hpa, tglx, mingo, tj, peterz,
	matt.fleming, will.auld, glenn.p.williamson, kanaka.d.juvva

On Thu, Mar 12, 2015 at 04:16:03PM -0700, Vikas Shivappa wrote:
> Add support for cache bit mask manipulation. The change adds a file to
> the RDT cgroup which represents the CBM(cache bit mask) for the cgroup.
> 
> The RDT cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> cgroup never fails.  When a child cgroup is created it inherits the
> CLOSid and the CBM from its parent.  When a user changes the default
> CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> used before. If the new CBM is the one that is already used, the
> count for that CLOSid<->CBM is incremented. The changing of 'cbm'
> may fail with -ENOSPC once the kernel runs out of maximum CLOSids it
> can support.
> User can create as many cgroups as he wants but having different CBMs
> at the same time is restricted by the maximum number of CLOSids
> (multiple cgroups can have the same CBM).
> Kernel maintains a CLOSid<->cbm mapping which keeps count
> of cgroups using a CLOSid.
> 
> The tasks in the CAT cgroup would get to fill the LLC cache represented
> by the cgroup's 'cbm' file.
> 
> Reuse of CLOSids for cgroups with same bitmask also has following
> advantages:
> - This helps to use the scant CLOSids optimally.
> - This also implies that during context switch, write to PQR-MSR is done
> only when a task with a different bitmask is scheduled in.
> 
> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
> ---
>  arch/x86/include/asm/intel_rdt.h |   3 +
>  arch/x86/kernel/cpu/intel_rdt.c  | 205 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 208 insertions(+)
> 
> diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
> index 87af1a5..0ed28d9 100644
> --- a/arch/x86/include/asm/intel_rdt.h
> +++ b/arch/x86/include/asm/intel_rdt.h
> @@ -4,6 +4,9 @@
>  #ifdef CONFIG_CGROUP_RDT
>  
>  #include <linux/cgroup.h>
> +#define MAX_CBM_LENGTH			32
> +#define IA32_L3_CBM_BASE		0xc90
> +#define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
>  
>  struct rdt_subsys_info {
>  	/* Clos Bitmap to keep track of available CLOSids.*/
> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
> index 3726f41..495497a 100644
> --- a/arch/x86/kernel/cpu/intel_rdt.c
> +++ b/arch/x86/kernel/cpu/intel_rdt.c
> @@ -33,6 +33,9 @@ static struct rdt_subsys_info rdtss_info;
>  static DEFINE_MUTEX(rdt_group_mutex);
>  struct intel_rdt rdt_root_group;
>  
> +#define rdt_for_each_child(pos_css, parent_ir)		\
> +	css_for_each_child((pos_css), &(parent_ir)->css)
> +
>  static inline bool cat_supported(struct cpuinfo_x86 *c)
>  {
>  	if (cpu_has(c, X86_FEATURE_CAT_L3))
> @@ -83,6 +86,31 @@ static int __init rdt_late_init(void)
>  late_initcall(rdt_late_init);
>  
>  /*
> + * Allocates a new closid from unused closids.
> + * Called with the rdt_group_mutex held.
> + */
> +
> +static int rdt_alloc_closid(struct intel_rdt *ir)
> +{
> +	unsigned int id;
> +	unsigned int maxid;
> +
> +	lockdep_assert_held(&rdt_group_mutex);
> +
> +	maxid = boot_cpu_data.x86_cat_closs;
> +	id = find_next_zero_bit(rdtss_info.closmap, maxid, 0);
> +	if (id == maxid)
> +		return -ENOSPC;
> +
> +	set_bit(id, rdtss_info.closmap);
> +	WARN_ON(ccmap[id].cgrp_count);
> +	ccmap[id].cgrp_count++;
> +	ir->clos = id;
> +
> +	return 0;
> +}
> +
> +/*
>  * Called with the rdt_group_mutex held.
>  */
>  static int rdt_free_closid(struct intel_rdt *ir)
> @@ -133,8 +161,185 @@ static void rdt_css_free(struct cgroup_subsys_state *css)
>  	mutex_unlock(&rdt_group_mutex);
>  }
>  
> +/*
> + * Tests if atleast two contiguous bits are set.
> + */
> +
> +static inline bool cbm_is_contiguous(unsigned long var)
> +{
> +	unsigned long first_bit, zero_bit;
> +	unsigned long maxcbm = MAX_CBM_LENGTH;
> +
> +	if (bitmap_weight(&var, maxcbm) < 2)
> +		return false;
> +
> +	first_bit = find_next_bit(&var, maxcbm, 0);
> +	zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
> +
> +	if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int cat_cbm_read(struct seq_file *m, void *v)
> +{
> +	struct intel_rdt *ir = css_rdt(seq_css(m));
> +
> +	seq_printf(m, "%08lx\n", ccmap[ir->clos].cbm);
> +	return 0;
> +}
> +
> +static int validate_cbm(struct intel_rdt *ir, unsigned long cbmvalue)
> +{
> +	struct intel_rdt *par, *c;
> +	struct cgroup_subsys_state *css;
> +	unsigned long *cbm_tmp;
> +
> +	if (!cbm_is_contiguous(cbmvalue)) {
> +		pr_info("cbm should have >= 2 bits and be contiguous\n");
> +		return -EINVAL;
> +	}
> +
> +	par = parent_rdt(ir);
> +	cbm_tmp = &ccmap[par->clos].cbm;
> +	if (!bitmap_subset(&cbmvalue, cbm_tmp, MAX_CBM_LENGTH))
> +		return -EINVAL;

Can you have different errors for the different cases?

> +	rcu_read_lock();
> +	rdt_for_each_child(css, ir) {
> +		c = css_rdt(css);
> +		cbm_tmp = &ccmap[c->clos].cbm;
> +		if (!bitmap_subset(cbm_tmp, &cbmvalue, MAX_CBM_LENGTH)) {
> +			pr_info("Children's mask not a subset\n");
> +			rcu_read_unlock();
> +			return -EINVAL;
> +		}
> +	}
> +
> +	rcu_read_unlock();
> +	return 0;
> +}
> +
> +static bool cbm_search(unsigned long cbm, int *closid)
> +{
> +	int maxid = boot_cpu_data.x86_cat_closs;
> +	unsigned int i;
> +
> +	for (i = 0; i < maxid; i++)
> +		if (bitmap_equal(&cbm, &ccmap[i].cbm, MAX_CBM_LENGTH)) {
> +			*closid = i;
> +			return true;
> +		}
> +
> +	return false;
> +}
> +
> +static void cbmmap_dump(void)
> +{
> +	int i;
> +
> +	pr_debug("CBMMAP\n");
> +	for (i = 0; i < boot_cpu_data.x86_cat_closs; i++)
> +		pr_debug("cbm: 0x%x,cgrp_count: %u\n",
> +		 (unsigned int)ccmap[i].cbm, ccmap[i].cgrp_count);
> +}
> +
> +static void cpu_cbm_update(void *info)
> +{
> +	unsigned int closid = *((unsigned int *)info);
> +
> +	wrmsrl(CBM_FROM_INDEX(closid), ccmap[closid].cbm);
> +}
> +
> +static inline void cbm_update(unsigned int closid)
> +{
> +	int pkg_id = -1;
> +	int cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		if (pkg_id == topology_physical_package_id(cpu))
> +			continue;
> +		smp_call_function_single(cpu, cpu_cbm_update, &closid, 1);
> +		pkg_id = topology_physical_package_id(cpu);
> +
> +

Can use smp_call_function_many, once, more efficient.

Can this race with CPU hotplug? BTW, on CPU hotplug, where are
the IA32_L3_MASK_n initialized for the new CPU ? 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT
  2015-04-09 20:56   ` Marcelo Tosatti
@ 2015-04-13  2:36     ` Vikas Shivappa
  0 siblings, 0 replies; 20+ messages in thread
From: Vikas Shivappa @ 2015-04-13  2:36 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Vikas Shivappa, vikas.shivappa, x86, linux-kernel, hpa, tglx,
	mingo, tj, peterz, matt.fleming, will.auld, glenn.p.williamson,
	kanaka.d.juvva



On Thu, 9 Apr 2015, Marcelo Tosatti wrote:

> On Thu, Mar 12, 2015 at 04:16:03PM -0700, Vikas Shivappa wrote:
>> Add support for cache bit mask manipulation. The change adds a file to
>> the RDT cgroup which represents the CBM(cache bit mask) for the cgroup.
>>
>> The RDT cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
>> cgroup never fails.  When a child cgroup is created it inherits the
>> CLOSid and the CBM from its parent.  When a user changes the default
>> CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
>> used before. If the new CBM is the one that is already used, the
>> count for that CLOSid<->CBM is incremented. The changing of 'cbm'
>> may fail with -ENOSPC once the kernel runs out of maximum CLOSids it
>> can support.
>> User can create as many cgroups as he wants but having different CBMs
>> at the same time is restricted by the maximum number of CLOSids
>> (multiple cgroups can have the same CBM).
>> Kernel maintains a CLOSid<->cbm mapping which keeps count
>> of cgroups using a CLOSid.
>>
>> The tasks in the CAT cgroup would get to fill the LLC cache represented
>> by the cgroup's 'cbm' file.
>>
>> Reuse of CLOSids for cgroups with same bitmask also has following
>> advantages:
>> - This helps to use the scant CLOSids optimally.
>> - This also implies that during context switch, write to PQR-MSR is done
>> only when a task with a different bitmask is scheduled in.
>>
>> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
>> ---
>>  arch/x86/include/asm/intel_rdt.h |   3 +
>>  arch/x86/kernel/cpu/intel_rdt.c  | 205 +++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 208 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
>> index 87af1a5..0ed28d9 100644
>> --- a/arch/x86/include/asm/intel_rdt.h
>> +++ b/arch/x86/include/asm/intel_rdt.h
>> @@ -4,6 +4,9 @@
>>  #ifdef CONFIG_CGROUP_RDT
>>
>>  #include <linux/cgroup.h>
>> +#define MAX_CBM_LENGTH			32
>> +#define IA32_L3_CBM_BASE		0xc90
>> +#define CBM_FROM_INDEX(x)		(IA32_L3_CBM_BASE + x)
>>
>>  struct rdt_subsys_info {
>>  	/* Clos Bitmap to keep track of available CLOSids.*/
>> diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c
>> index 3726f41..495497a 100644
>> --- a/arch/x86/kernel/cpu/intel_rdt.c
>> +++ b/arch/x86/kernel/cpu/intel_rdt.c
>> @@ -33,6 +33,9 @@ static struct rdt_subsys_info rdtss_info;
>>  static DEFINE_MUTEX(rdt_group_mutex);
>>  struct intel_rdt rdt_root_group;
>>
>> +#define rdt_for_each_child(pos_css, parent_ir)		\
>> +	css_for_each_child((pos_css), &(parent_ir)->css)
>> +
>>  static inline bool cat_supported(struct cpuinfo_x86 *c)
>>  {
>>  	if (cpu_has(c, X86_FEATURE_CAT_L3))
>> @@ -83,6 +86,31 @@ static int __init rdt_late_init(void)
>>  late_initcall(rdt_late_init);
>>
>>  /*
>> + * Allocates a new closid from unused closids.
>> + * Called with the rdt_group_mutex held.
>> + */
>> +
>> +static int rdt_alloc_closid(struct intel_rdt *ir)
>> +{
>> +	unsigned int id;
>> +	unsigned int maxid;
>> +
>> +	lockdep_assert_held(&rdt_group_mutex);
>> +
>> +	maxid = boot_cpu_data.x86_cat_closs;
>> +	id = find_next_zero_bit(rdtss_info.closmap, maxid, 0);
>> +	if (id == maxid)
>> +		return -ENOSPC;
>> +
>> +	set_bit(id, rdtss_info.closmap);
>> +	WARN_ON(ccmap[id].cgrp_count);
>> +	ccmap[id].cgrp_count++;
>> +	ir->clos = id;
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>>  * Called with the rdt_group_mutex held.
>>  */
>>  static int rdt_free_closid(struct intel_rdt *ir)
>> @@ -133,8 +161,185 @@ static void rdt_css_free(struct cgroup_subsys_state *css)
>>  	mutex_unlock(&rdt_group_mutex);
>>  }
>>
>> +/*
>> + * Tests if atleast two contiguous bits are set.
>> + */
>> +
>> +static inline bool cbm_is_contiguous(unsigned long var)
>> +{
>> +	unsigned long first_bit, zero_bit;
>> +	unsigned long maxcbm = MAX_CBM_LENGTH;
>> +
>> +	if (bitmap_weight(&var, maxcbm) < 2)
>> +		return false;
>> +
>> +	first_bit = find_next_bit(&var, maxcbm, 0);
>> +	zero_bit = find_next_zero_bit(&var, maxcbm, first_bit);
>> +
>> +	if (find_next_bit(&var, maxcbm, zero_bit) < maxcbm)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static int cat_cbm_read(struct seq_file *m, void *v)
>> +{
>> +	struct intel_rdt *ir = css_rdt(seq_css(m));
>> +
>> +	seq_printf(m, "%08lx\n", ccmap[ir->clos].cbm);
>> +	return 0;
>> +}
>> +
>> +static int validate_cbm(struct intel_rdt *ir, unsigned long cbmvalue)
>> +{
>> +	struct intel_rdt *par, *c;
>> +	struct cgroup_subsys_state *css;
>> +	unsigned long *cbm_tmp;
>> +
>> +	if (!cbm_is_contiguous(cbmvalue)) {
>> +		pr_info("cbm should have >= 2 bits and be contiguous\n");
>> +		return -EINVAL;
>> +	}
>> +
>> +	par = parent_rdt(ir);
>> +	cbm_tmp = &ccmap[par->clos].cbm;
>> +	if (!bitmap_subset(&cbmvalue, cbm_tmp, MAX_CBM_LENGTH))
>> +		return -EINVAL;
>
> Can you have different errors for the different cases?

Could use -EPER

>
>> +	rcu_read_lock();
>> +	rdt_for_each_child(css, ir) {
>> +		c = css_rdt(css);
>> +		cbm_tmp = &ccmap[c->clos].cbm;
>> +		if (!bitmap_subset(cbm_tmp, &cbmvalue, MAX_CBM_LENGTH)) {
>> +			pr_info("Children's mask not a subset\n");
>> +			rcu_read_unlock();
>> +			return -EINVAL;
>> +		}
>> +	}
>> +
>> +	rcu_read_unlock();
>> +	return 0;
>> +}
>> +
>> +static bool cbm_search(unsigned long cbm, int *closid)
>> +{
>> +	int maxid = boot_cpu_data.x86_cat_closs;
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < maxid; i++)
>> +		if (bitmap_equal(&cbm, &ccmap[i].cbm, MAX_CBM_LENGTH)) {
>> +			*closid = i;
>> +			return true;
>> +		}
>> +
>> +	return false;
>> +}
>> +
>> +static void cbmmap_dump(void)
>> +{
>> +	int i;
>> +
>> +	pr_debug("CBMMAP\n");
>> +	for (i = 0; i < boot_cpu_data.x86_cat_closs; i++)
>> +		pr_debug("cbm: 0x%x,cgrp_count: %u\n",
>> +		 (unsigned int)ccmap[i].cbm, ccmap[i].cgrp_count);
>> +}
>> +
>> +static void cpu_cbm_update(void *info)
>> +{
>> +	unsigned int closid = *((unsigned int *)info);
>> +
>> +	wrmsrl(CBM_FROM_INDEX(closid), ccmap[closid].cbm);
>> +}
>> +
>> +static inline void cbm_update(unsigned int closid)
>> +{
>> +	int pkg_id = -1;
>> +	int cpu;
>> +
>> +	for_each_online_cpu(cpu) {
>> +		if (pkg_id == topology_physical_package_id(cpu))
>> +			continue;
>> +		smp_call_function_single(cpu, cpu_cbm_update, &closid, 1);
>> +		pkg_id = topology_physical_package_id(cpu);
>> +
>> +
>
> Can use smp_call_function_many, once, more efficient.
>
> Can this race with CPU hotplug? BTW, on CPU hotplug, where are
> the IA32_L3_MASK_n initialized for the new CPU ?
>

Thanks for pointing out , Will fix this . Think i was terrible when i changed 
the design to not use 
the cpuset did not change the hot cpu update , I remembered an other similar
update needed.The s3 resume needs a fix to the software cache as we used the msr before.

Thanks,
Vikas

>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide
  2015-03-31 17:27         ` Vikas Shivappa
  2015-03-31 22:56           ` Marcelo Tosatti
@ 2015-07-28 23:37           ` Marcelo Tosatti
  2015-07-29 21:20             ` Vikas Shivappa
  1 sibling, 1 reply; 20+ messages in thread
From: Marcelo Tosatti @ 2015-07-28 23:37 UTC (permalink / raw)
  To: Vikas Shivappa
  Cc: Vikas Shivappa, x86, linux-kernel, hpa, tglx, mingo, tj, peterz,
	matt.fleming, will.auld, glenn.p.williamson, kanaka.d.juvva

On Tue, Mar 31, 2015 at 10:27:32AM -0700, Vikas Shivappa wrote:
> 
> 
> On Thu, 26 Mar 2015, Marcelo Tosatti wrote:
> 
> >
> >I can't find any discussion relating to exposing the CBM interface
> >directly to userspace in that thread ?
> >
> >Cpu.shares is written in ratio form, which is much more natural.
> >Do you see any advantage in maintaining the
> >
> >(ratio -> cbm bitmasks)
> >
> >translation in userspace rather than in the kernel ?
> >
> >What about something like:
> >
> >
> >		      root cgroup
> >		   /		  \
> >		  /		    \
> >		/		      \
> >	cgroupA-80			cgroupB-30
> >
> >
> >So that whatever exceeds 100% is the ratio of cache
> >shared at that level (cgroup A and B share 10% of cache
> >at that level).
> 
> But this also means the 2 groups share all of the cache ?
> 
> Specifying the amount of bits to be shared lets you specify the
> exact cache area where you want to share and also when your total
> occupancy does not cover all of the cache. For ex: it gets more
> complex when you want to share say only the left quarter of the
> cache. cgroupA gets left half and cgroup gets left quarter. The
> bitmask aligns with how the h/w is designed to share the cache which
> gives you flexibility to define any specific overlapping areas of
> the cache.
> 
> >
> >https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html
> >
> >cpu — the cpu.shares parameter determines the share of CPU resources
> >available to each process in all cgroups. Setting the parameter to 250,
> >250, and 500 in the finance, sales, and engineering cgroups respectively
> >means that processes started in these groups will split the resources
> >with a 1:1:2 ratio. Note that when a single process is running, it
> >consumes as much CPU as necessary no matter which cgroup it is placed
> >in. The CPU limitation only comes into effect when two or more processes
> >compete for CPU resources.
> >
> >
> 
> These are more defined in terms of how many cache lines (or how many
> cache ways) they can use and would be difficult to define them in
> terms of percentage. In contrast the cpu share is a time shared
> thing and is much more granular where as here its not , its
> occupancy in terms of cache lines/ways.. (however this is not really
> defined as a restriction but thats the way it is now).
> Also note that the granularity of the bitmasks define the
> granularity of the percentages and in some SKUs the granularity is
> 2b and not 1b.. So technically you wont be able to even allocate
> percentage of cache even in 10% granularity for most of the cases
> (if there are 30MB and 25 ways like in one of hsw SKU) and this will
> vary for different SKUs which makes it more complicated for users.
> However the user library is free to define own interface based on
> the underlying cgroup interface say for example you never care about
> the overlapping and using it for a specific SKU etc.. The underlying
> cgroup framework is meant to be  generic for all SKus and used for
> most of the use cases.
> 
> Also at this point I see a lot of enterprise and and other users
> already using the cgroup interface or shown interest in the same.
> However I see your point where you indicate the ease with which user
> can specify in size/percentage which he might be used to doing for
> other resources rather than bits where he needs to get an idea size
> by calculating it seperately - But again note that you may not be
> able to define percentages in many scenarios like the one above. And
> another question would be we would need to convince the users to
> adapt to the modified percentage user model (ex: like the one you
> say above where percentage - 100 is the one thats shared)
> I can review this requirements and others I have received and get
> back to see the closest that can be done if possible.
> 
> Thanks,
> Vikas

Vikas,

Three questions:

First, usage model. The usage model for CAT is the following
(please correct me if i'm wrong):

1) measure application performance without L3 cache reservation.
2) measure application perf with L3 cache reservation and
X number of ways until desired perf is attained.

On migration to a new hardware platform, to achieve similar benefit
achieved when going from 1) to 2) is to reserve _at least_ the number of
bytes that "X ways" provided when the measurement was performed. Is that
correct?

If that is correct, then the user does want to record "number of bytes"
that X ways on measurement CPU provided. 

Second question: 
Do you envision any use case which the placement of cache 
and not the quantity of cache is a criteria for decision?
That is, two cases with the same amount of cache for each CLOSid, 
but with different locations inside the cache? 
(except sharing of ways by two CLOSid's, of course).

Third question:
How about support for the (new) I/D cache division? 








^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide
  2015-07-28 23:37           ` Marcelo Tosatti
@ 2015-07-29 21:20             ` Vikas Shivappa
  0 siblings, 0 replies; 20+ messages in thread
From: Vikas Shivappa @ 2015-07-29 21:20 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Vikas Shivappa, Vikas Shivappa, x86, linux-kernel, hpa, tglx,
	mingo, tj, peterz, matt.fleming, will.auld, glenn.p.williamson,
	kanaka.d.juvva

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5672 bytes --]



On Tue, 28 Jul 2015, Marcelo Tosatti wrote:

> On Tue, Mar 31, 2015 at 10:27:32AM -0700, Vikas Shivappa wrote:
>>
>>
>> On Thu, 26 Mar 2015, Marcelo Tosatti wrote:
>>
>>>
>>> I can't find any discussion relating to exposing the CBM interface
>>> directly to userspace in that thread ?
>>>
>>> Cpu.shares is written in ratio form, which is much more natural.
>>> Do you see any advantage in maintaining the
>>>
>>> (ratio -> cbm bitmasks)
>>>
>>> translation in userspace rather than in the kernel ?
>>>
>>> What about something like:
>>>
>>>
>>> 		      root cgroup
>>> 		   /		  \
>>> 		  /		    \
>>> 		/		      \
>>> 	cgroupA-80			cgroupB-30
>>>
>>>
>>> So that whatever exceeds 100% is the ratio of cache
>>> shared at that level (cgroup A and B share 10% of cache
>>> at that level).
>>
>> But this also means the 2 groups share all of the cache ?
>>
>> Specifying the amount of bits to be shared lets you specify the
>> exact cache area where you want to share and also when your total
>> occupancy does not cover all of the cache. For ex: it gets more
>> complex when you want to share say only the left quarter of the
>> cache. cgroupA gets left half and cgroup gets left quarter. The
>> bitmask aligns with how the h/w is designed to share the cache which
>> gives you flexibility to define any specific overlapping areas of
>> the cache.
>>
>>>
>>> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html
>>>
>>> cpu — the cpu.shares parameter determines the share of CPU resources
>>> available to each process in all cgroups. Setting the parameter to 250,
>>> 250, and 500 in the finance, sales, and engineering cgroups respectively
>>> means that processes started in these groups will split the resources
>>> with a 1:1:2 ratio. Note that when a single process is running, it
>>> consumes as much CPU as necessary no matter which cgroup it is placed
>>> in. The CPU limitation only comes into effect when two or more processes
>>> compete for CPU resources.
>>>
>>>
>>
>> These are more defined in terms of how many cache lines (or how many
>> cache ways) they can use and would be difficult to define them in
>> terms of percentage. In contrast the cpu share is a time shared
>> thing and is much more granular where as here its not , its
>> occupancy in terms of cache lines/ways.. (however this is not really
>> defined as a restriction but thats the way it is now).
>> Also note that the granularity of the bitmasks define the
>> granularity of the percentages and in some SKUs the granularity is
>> 2b and not 1b.. So technically you wont be able to even allocate
>> percentage of cache even in 10% granularity for most of the cases
>> (if there are 30MB and 25 ways like in one of hsw SKU) and this will
>> vary for different SKUs which makes it more complicated for users.
>> However the user library is free to define own interface based on
>> the underlying cgroup interface say for example you never care about
>> the overlapping and using it for a specific SKU etc.. The underlying
>> cgroup framework is meant to be  generic for all SKus and used for
>> most of the use cases.
>>
>> Also at this point I see a lot of enterprise and and other users
>> already using the cgroup interface or shown interest in the same.
>> However I see your point where you indicate the ease with which user
>> can specify in size/percentage which he might be used to doing for
>> other resources rather than bits where he needs to get an idea size
>> by calculating it seperately - But again note that you may not be
>> able to define percentages in many scenarios like the one above. And
>> another question would be we would need to convince the users to
>> adapt to the modified percentage user model (ex: like the one you
>> say above where percentage - 100 is the one thats shared)
>> I can review this requirements and others I have received and get
>> back to see the closest that can be done if possible.
>>
>> Thanks,
>> Vikas
>
> Vikas,
>
> Three questions:
>
> First, usage model. The usage model for CAT is the following
> (please correct me if i'm wrong):
>
> 1) measure application performance without L3 cache reservation.
> 2) measure application perf with L3 cache reservation and
> X number of ways until desired perf is attained.
>
> On migration to a new hardware platform, to achieve similar benefit
> achieved when going from 1) to 2) is to reserve _at least_ the number of
> bytes that "X ways" provided when the measurement was performed. Is that
> correct?
>
> If that is correct, then the user does want to record "number of bytes"
> that X ways on measurement CPU provided.
>

The number of ways mapping to bits is implementation dependent. So we really 
cannot refer one way as a bit..

to map the size to bits. could check the cache capacity in /proc and then the 
number of bits in the cbm (max bits are shown in the root intel_rdt cgroup) .
ex: cache is 2MB. we have 16 bits cbm - a mask of 0xff would represent 1MB.

> Second question:
> Do you envision any use case which the placement of cache
> and not the quantity of cache is a criteria for decision?
> That is, two cases with the same amount of cache for each CLOSid,
> but with different locations inside the cache?
> (except sharing of ways by two CLOSid's, of course).
>

cbm max - 16 bits.  000f - allocate right quarter. f000 - allocate left 
quarter.. ? extend the case to any number of valid contiguous bits.


> Third question:
> How about support for the (new) I/D cache division?
>

Planning to be sending a patch end of this week or early next week.

Thanks,
Vikas


>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-07-29 21:20 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-12 23:16 [PATCH V5 0/7] x86/intel_rdt: Intel Cache Allocation Technology Vikas Shivappa
2015-03-12 23:16 ` [PATCH 1/7] x86/intel_rdt: Intel Cache Allocation Technology detection Vikas Shivappa
2015-03-12 23:16 ` [PATCH 2/7] x86/intel_rdt: Adds support for Class of service management Vikas Shivappa
2015-03-12 23:16 ` [PATCH 3/7] x86/intel_rdt: Support cache bit mask for Intel CAT Vikas Shivappa
2015-04-09 20:56   ` Marcelo Tosatti
2015-04-13  2:36     ` Vikas Shivappa
2015-03-12 23:16 ` [PATCH 4/7] x86/intel_rdt: Implement scheduling support for Intel RDT Vikas Shivappa
2015-03-12 23:16 ` [PATCH 5/7] x86/intel_rdt: Software Cache for IA32_PQR_MSR Vikas Shivappa
2015-03-12 23:16 ` [PATCH 6/7] x86/intel_rdt: Intel haswell CAT enumeration Vikas Shivappa
2015-03-12 23:16 ` [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide Vikas Shivappa
2015-03-25 22:39   ` Marcelo Tosatti
2015-03-26 18:38     ` Vikas Shivappa
2015-03-27  1:29       ` Marcelo Tosatti
2015-03-31  1:17         ` Marcelo Tosatti
2015-03-31 17:27         ` Vikas Shivappa
2015-03-31 22:56           ` Marcelo Tosatti
2015-04-01 18:20             ` Vikas Shivappa
2015-07-28 23:37           ` Marcelo Tosatti
2015-07-29 21:20             ` Vikas Shivappa
2015-03-31 17:32         ` Vikas Shivappa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).