Netdev Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: [RFC][PATCH] PCI: Reserve address space for powered-off devices behind PCIe bridges
       [not found] <BYAPR11MB3527AEB1E4969C0833D1697ED9129@BYAPR11MB3527.namprd11.prod.outlook.com>
@ 2021-07-15 17:13 ` Bjorn Helgaas
  2021-07-15 18:12   ` Billie Alsup (balsup)
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2021-07-15 17:13 UTC (permalink / raw)
  To: Billie Alsup (balsup)
  Cc: Paul Menzel, Bjorn Helgaas, Guohan Lu,
	Madhava Reddy Siddareddygari (msiddare),
	linux-pci, linux-kernel, David S. Miller, Jakub Kicinski, netdev,
	Sergey Miroshnichenko

On Thu, Jul 15, 2021 at 04:52:26PM +0000, Billie Alsup (balsup) wrote:
> We are aware of how Cisco device specific this code is, and hadn't
> intended to upstream it.  This code was originally written for an
> older kernel version (4.8.28-WR9.0.0.26_cgl).  I am not the original
> author; I just ported it into various SONiC linux kernels.  We use
> ACPI with SONiC (although not on our non-SONiC products), so I
> thought I might be able to define such windows within the ACPI tree
> and have some generic code to read such configuration information
> from the ACPI tables,. However, initial attempts failed so I went
> with the existing approach.  I believe we did look at the hpmmiosize
> parameter, but iirc it applied to each bridge, rather than being a
> pool of address space to dynamically parcel out as necessary.

Right.  I mentioned "pci=resource_alignment=" because it claims to be
able to specify window sizes for specific bridges.  But I haven't
exercised that myself.

> There are multiple bridges involved in the hardware (there are 8
> hot-plug fabric cards, each with multiple PCI devices).  Devices on
> the card are in multiple power zones, so all devices are not
> immediately visible to the pci scanning code.  The top level bridge
> reserves close to 5G.  The 2nd level (towards the fabric cards)
> reserve 4.5G.  The 3rd level has 9 bridges each reserving 512M.  The
> 4th level reserves 384M (with a 512M alignment restriction iirc).
> The 5th level reserves 384M (again with an alignment restriction).
> That defines the bridge hierarchy visible at boot.  Things behind
> that 5th level are hot-plugged where there are two more bridge
> levels and 5 devices (1 requiring 2x64M blocks and 4 requiring
> 1x64M).
> 
> I'm not sure if the Cisco kernel team has revisited the hpmmiosize
> and resource_alignment parameters since this initial implementation.
> Reading the description of Sergey's patches, he seems to be
> approaching the same problem from a different direction.   It is
> unclear if such an approach is practical for our environment.   It
> would require updates to all of our SONiC drivers to support
> stopping/remapping/restarting, and it is unclear if that is
> acceptable.  It is certainly less preferable to pre-reserving the
> required space.  For our embedded product, we know exactly what
> devices will be plugged in, and allowing that to be pre-programmed
> into the PLX eeprom gives us the flexibility we need.  

If you know up front what devices are possible and how much space they
need, possibly your firmware could assign the bridge windows you need.
Linux generally does not change window assignments unless they are
broken.

Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] PCI: Reserve address space for powered-off devices behind PCIe bridges
  2021-07-15 17:13 ` [RFC][PATCH] PCI: Reserve address space for powered-off devices behind PCIe bridges Bjorn Helgaas
@ 2021-07-15 18:12   ` Billie Alsup (balsup)
  2021-07-15 18:56     ` Bjorn Helgaas
  0 siblings, 1 reply; 7+ messages in thread
From: Billie Alsup (balsup) @ 2021-07-15 18:12 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Paul Menzel, Bjorn Helgaas, Guohan Lu,
	Madhava Reddy Siddareddygari (msiddare),
	linux-pci, linux-kernel, David S. Miller, Jakub Kicinski, netdev,
	Sergey Miroshnichenko

It took me a while to figure out that the "New Outlook" option doesn't actually allow sending plain text, so I have to switch to "Old Outlook" mode.

It is not clear as to what parameters Linux would use to consider a window broken.  But if the kernel preserves some bridge window assignment, then it seems feasible for our BIOS to run this same algorithm (reading PLX persistent scratch registers to determine window sizes).  I will raise this possibility with our own kernel team to discuss with the bios team.  We can also look more closely at the resource_alignment options to see if that might suffice. Thanks for the information!


From: Bjorn Helgaas <helgaas@kernel.org>
Date: Thursday, July 15, 2021 at 10:14 AM
To: "Billie Alsup (balsup)" <balsup@cisco.com>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>, Bjorn Helgaas <bhelgaas@google.com>, Guohan Lu <lguohan@gmail.com>, "Madhava Reddy Siddareddygari (msiddare)" <msiddare@cisco.com>, "linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, "netdev@vger.kernel.org" <netdev@vger.kernel.org>, Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
Subject: Re: [RFC][PATCH] PCI: Reserve address space for powered-off devices behind PCIe bridges

On Thu, Jul 15, 2021 at 04:52:26PM +0000, Billie Alsup (balsup) wrote:
We are aware of how Cisco device specific this code is, and hadn't
intended to upstream it.  This code was originally written for an
older kernel version (4.8.28-WR9.0.0.26_cgl).  I am not the original
author; I just ported it into various SONiC linux kernels.  We use
ACPI with SONiC (although not on our non-SONiC products), so I
thought I might be able to define such windows within the ACPI tree
and have some generic code to read such configuration information
from the ACPI tables,. However, initial attempts failed so I went
with the existing approach.  I believe we did look at the hpmmiosize
parameter, but iirc it applied to each bridge, rather than being a
pool of address space to dynamically parcel out as necessary.

Right.  I mentioned "pci=resource_alignment=" because it claims to be
able to specify window sizes for specific bridges.  But I haven't
exercised that myself.

There are multiple bridges involved in the hardware (there are 8
hot-plug fabric cards, each with multiple PCI devices).  Devices on
the card are in multiple power zones, so all devices are not
immediately visible to the pci scanning code.  The top level bridge
reserves close to 5G.  The 2nd level (towards the fabric cards)
reserve 4.5G.  The 3rd level has 9 bridges each reserving 512M.  The
4th level reserves 384M (with a 512M alignment restriction iirc).
The 5th level reserves 384M (again with an alignment restriction).
That defines the bridge hierarchy visible at boot.  Things behind
that 5th level are hot-plugged where there are two more bridge
levels and 5 devices (1 requiring 2x64M blocks and 4 requiring
1x64M).

I'm not sure if the Cisco kernel team has revisited the hpmmiosize
and resource_alignment parameters since this initial implementation.
Reading the description of Sergey's patches, he seems to be
approaching the same problem from a different direction.   It is
unclear if such an approach is practical for our environment.   It
would require updates to all of our SONiC drivers to support
stopping/remapping/restarting, and it is unclear if that is
acceptable.  It is certainly less preferable to pre-reserving the
required space.  For our embedded product, we know exactly what
devices will be plugged in, and allowing that to be pre-programmed
into the PLX eeprom gives us the flexibility we need.  

If you know up front what devices are possible and how much space they
need, possibly your firmware could assign the bridge windows you need.
Linux generally does not change window assignments unless they are
broken.

Bjorn



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] PCI: Reserve address space for powered-off devices behind PCIe bridges
  2021-07-15 18:12   ` Billie Alsup (balsup)
@ 2021-07-15 18:56     ` Bjorn Helgaas
  2021-07-15 19:49       ` Billie Alsup (balsup)
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2021-07-15 18:56 UTC (permalink / raw)
  To: Billie Alsup (balsup)
  Cc: Paul Menzel, Bjorn Helgaas, Guohan Lu,
	Madhava Reddy Siddareddygari (msiddare),
	linux-pci, linux-kernel, David S. Miller, Jakub Kicinski, netdev,
	Sergey Miroshnichenko

On Thu, Jul 15, 2021 at 06:12:25PM +0000, Billie Alsup (balsup) wrote:
> It took me a while to figure out that the "New Outlook" option
> doesn't actually allow sending plain text, so I have to switch to
> "Old Outlook" mode.

Since you've gone to that much trouble already, also note
http://vger.kernel.org/majordomo-info.html and
https://people.kernel.org/tglx/notes-about-netiquette 

BTW, the attribution in the email you quoted below got corrupted in
such a way that it appears that I wrote the whole thing, instead of 
what actually happened, which is that I wrote a half dozen lines of
response to your email.  Linux uses old skool email conventions ;)

> It is not clear as to what parameters Linux would use to consider a
> window broken.

By "broken," I just mean things that are prohibited by the PCI spec,
like "region doesn't fit in a window of an upstream device" or
"non-prefetchable region placed in a prefetchable window."

> But if the kernel preserves some bridge window
> assignment, then it seems feasible for our BIOS to run this same
> algorithm (reading PLX persistent scratch registers to determine
> window sizes).  I will raise this possibility with our own kernel
> team to discuss with the bios team.  We can also look more closely
> at the resource_alignment options to see if that might suffice.
> Thanks for the information!

> From: Bjorn Helgaas <helgaas@kernel.org>
> Date: Thursday, July 15, 2021 at 10:14 AM
> To: "Billie Alsup (balsup)" <balsup@cisco.com>
> Cc: Paul Menzel <pmenzel@molgen.mpg.de>, Bjorn Helgaas <bhelgaas@google.com>, Guohan Lu <lguohan@gmail.com>, "Madhava Reddy Siddareddygari (msiddare)" <msiddare@cisco.com>, "linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, "netdev@vger.kernel.org" <netdev@vger.kernel.org>, Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
> Subject: Re: [RFC][PATCH] PCI: Reserve address space for powered-off devices behind PCIe bridges
> 
> On Thu, Jul 15, 2021 at 04:52:26PM +0000, Billie Alsup (balsup) wrote:
> We are aware of how Cisco device specific this code is, and hadn't
> intended to upstream it.  This code was originally written for an
> older kernel version (4.8.28-WR9.0.0.26_cgl).  I am not the original
> author; I just ported it into various SONiC linux kernels.  We use
> ACPI with SONiC (although not on our non-SONiC products), so I
> thought I might be able to define such windows within the ACPI tree
> and have some generic code to read such configuration information
> from the ACPI tables,. However, initial attempts failed so I went
> with the existing approach.  I believe we did look at the hpmmiosize
> parameter, but iirc it applied to each bridge, rather than being a
> pool of address space to dynamically parcel out as necessary.
> 
> Right.  I mentioned "pci=resource_alignment=" because it claims to be
> able to specify window sizes for specific bridges.  But I haven't
> exercised that myself.
> 
> There are multiple bridges involved in the hardware (there are 8
> hot-plug fabric cards, each with multiple PCI devices).  Devices on
> the card are in multiple power zones, so all devices are not
> immediately visible to the pci scanning code.  The top level bridge
> reserves close to 5G.  The 2nd level (towards the fabric cards)
> reserve 4.5G.  The 3rd level has 9 bridges each reserving 512M.  The
> 4th level reserves 384M (with a 512M alignment restriction iirc).
> The 5th level reserves 384M (again with an alignment restriction).
> That defines the bridge hierarchy visible at boot.  Things behind
> that 5th level are hot-plugged where there are two more bridge
> levels and 5 devices (1 requiring 2x64M blocks and 4 requiring
> 1x64M).
> 
> I'm not sure if the Cisco kernel team has revisited the hpmmiosize
> and resource_alignment parameters since this initial implementation.
> Reading the description of Sergey's patches, he seems to be
> approaching the same problem from a different direction.   It is
> unclear if such an approach is practical for our environment.   It
> would require updates to all of our SONiC drivers to support
> stopping/remapping/restarting, and it is unclear if that is
> acceptable.  It is certainly less preferable to pre-reserving the
> required space.  For our embedded product, we know exactly what
> devices will be plugged in, and allowing that to be pre-programmed
> into the PLX eeprom gives us the flexibility we need.  
> 
> If you know up front what devices are possible and how much space they
> need, possibly your firmware could assign the bridge windows you need.
> Linux generally does not change window assignments unless they are
> broken.
> 
> Bjorn
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] PCI: Reserve address space for powered-off devices behind PCIe bridges
  2021-07-15 18:56     ` Bjorn Helgaas
@ 2021-07-15 19:49       ` Billie Alsup (balsup)
  2021-07-15 19:56         ` Bjorn Helgaas
  0 siblings, 1 reply; 7+ messages in thread
From: Billie Alsup (balsup) @ 2021-07-15 19:49 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Paul Menzel, Bjorn Helgaas, Guohan Lu,
	Madhava Reddy Siddareddygari (msiddare),
	linux-pci, linux-kernel, David S. Miller, Jakub Kicinski, netdev,
	Sergey Miroshnichenko



    >On 7/15/21, 11:57 AM, "Bjorn Helgaas" <helgaas@kernel.org> wrote:

    >Since you've gone to that much trouble already, also note
    >http://vger.kernel.org/majordomo-info.html and
    >https://people.kernel.org/tglx/notes-about-netiquette 

My apologies. I was just cc'd on a thread and blindly responded.  
I didn't realize my mistake until receiving bounce messages for the html formatted message.

    >BTW, the attribution in the email you quoted below got corrupted in
    >such a way that it appears that I wrote the whole thing, instead of 
    >what actually happened, which is that I wrote a half dozen lines of
    >response to your email.  Linux uses old skool email conventions ;)

I will pay closer attention next time.  Again, my apologies.
Outlook really is a bad client for these types of emails!


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] PCI: Reserve address space for powered-off devices behind PCIe bridges
  2021-07-15 19:49       ` Billie Alsup (balsup)
@ 2021-07-15 19:56         ` Bjorn Helgaas
  0 siblings, 0 replies; 7+ messages in thread
From: Bjorn Helgaas @ 2021-07-15 19:56 UTC (permalink / raw)
  To: Billie Alsup (balsup)
  Cc: Paul Menzel, Bjorn Helgaas, Guohan Lu,
	Madhava Reddy Siddareddygari (msiddare),
	linux-pci, linux-kernel, David S. Miller, Jakub Kicinski, netdev,
	Sergey Miroshnichenko

On Thu, Jul 15, 2021 at 07:49:56PM +0000, Billie Alsup (balsup) wrote:
>     >On 7/15/21, 11:57 AM, "Bjorn Helgaas" <helgaas@kernel.org> wrote:
> 
>     >Since you've gone to that much trouble already, also note
>     >http://vger.kernel.org/majordomo-info.html and
>     >https://people.kernel.org/tglx/notes-about-netiquette 
> 
> My apologies. I was just cc'd on a thread and blindly responded.  
> I didn't realize my mistake until receiving bounce messages for the html formatted message.
> 
>     >BTW, the attribution in the email you quoted below got corrupted in
>     >such a way that it appears that I wrote the whole thing, instead of 
>     >what actually happened, which is that I wrote a half dozen lines of
>     >response to your email.  Linux uses old skool email conventions ;)
> 
> I will pay closer attention next time.  Again, my apologies.
> Outlook really is a bad client for these types of emails!

Can't speak to Outlook, but so is Gmail :(  Thanks for starting the
conversation!

Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] PCI: Reserve address space for powered-off devices behind PCIe bridges
  2021-07-15 13:50 ` Paul Menzel
@ 2021-07-15 14:07   ` Bjorn Helgaas
  0 siblings, 0 replies; 7+ messages in thread
From: Bjorn Helgaas @ 2021-07-15 14:07 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Bjorn Helgaas, Guohan Lu, Billie Alsup,
	Madhava Reddy Siddareddygari, linux-pci, linux-kernel,
	David S. Miller, Jakub Kicinski, netdev, Sergey Miroshnichenko

[+cc Sergey]

On Thu, Jul 15, 2021 at 03:50:49PM +0200, Paul Menzel wrote:
> [Add Billie’s correct address.]
> 
> Am 13.07.21 um 09:31 schrieb Paul Menzel:
> > From: balsup <balsup@contoso.com>
> 
> Billie Alsup <balsup@cisco.com>
> 
> > Data path devices are powered off by default, they will not be visible at
> > BIOS stage and memory for these devices is not reserved.
> > 
> > By default, no address space would be reserved on the bridges for these
> > unpowered devices. When they were powered up, they could fail to initialize
> > because there was no appropriately aligned window available for a given
> > BAR.
> > 
> > This patch will reserve address space for data path devices that are behind
> > PCIe bridge, so that when devices are available PCIe subsystem will be
> > assign the address within the specified range.
> > 
> > Signed-off-by: Madhava Reddy Siddareddygari <msiddare@cisco.com>
> > ---
> > This patch was submitted to the SONiC project for a Cisco device [1].
> > It’s better to have it reviewed and committed upstream though.
> > 
> >   drivers/pci/setup-bus.c | 159 ++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 159 insertions(+)
> > 
> > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> > index 2ce636937c6e..266097984e19 100644
> > --- a/drivers/pci/setup-bus.c
> > +++ b/drivers/pci/setup-bus.c
> > @@ -967,6 +967,148 @@ static inline resource_size_t calculate_mem_align(resource_size_t *aligns,
> >   	return min_align;
> >   }
> > +#define PLX_RES_MAGIC_VALUE            0xABBA
> > +#define PLX_RES_DS_PORT_REG0           0xC6C
> > +#define PLX_RES_DS_PORT_REG1           0xC70
> > +#define PLX_RES_MAGIC_OFFSET           0xC76
> > +#define PLX_RES_NP_MASK                0x1
> > +#define PLX_RES_P_MASK                 0x1F
> > +
> > +static struct pci_dev *
> > +plx_find_nt_device(struct pci_bus *bus, unsigned short brg_dev_id)
> > +{
> > +	struct pci_dev *dev, *nt_virt_dev = NULL;
> > +	struct pci_bus *child_bus;
> > +	unsigned short vendor, devid, class;
> > +
> > +	if (!bus)
> > +		return NULL;
> > +
> > +	list_for_each_entry(child_bus, &bus->children, node) {
> > +		list_for_each_entry(dev, &child_bus->devices, bus_list) {
> > +			vendor = dev->vendor;
> > +			devid = dev->device;
> > +			class = dev->class >> 8;
> > +
> > +			if ((vendor == PCI_VENDOR_ID_PLX) &&
> > +					(brg_dev_id == devid) &&
> > +					(class == PCI_CLASS_BRIDGE_OTHER)) {
> > +				dev_dbg(&dev->dev, "Found NT device 0x%x\n",
> > +						devid);
> > +				nt_virt_dev = dev;
> > +				break;
> > +			}
> > +		}
> > +
> > +		if (nt_virt_dev)
> > +			break;
> > +	}
> > +	return nt_virt_dev;
> > +}
> > +
> > +static resource_size_t
> > +pci_get_plx_downstream_res_size(struct pci_bus *bus, unsigned long res_type)
> > +{
> > +	int depth = 0;
> > +	resource_size_t size = 0;
> > +	struct pci_dev *dev = bus->self;
> > +	struct pci_bus *tmp_bus;
> > +	struct pci_dev *nt_virt_dev;
> > +	u16 res_magic = 0;
> > +
> > +	/*
> > +	 * 32 bits to store the memory requirement for PLX ports.
> > +	 * Following is the layout:
> > +	 * np32_0:1;  --> non-prefetchable port 0
> > +	 * p64_0:5;   --> prefetchable port 0
> > +	 * np32_1:1;  --> non-prefetchable port 1
> > +	 * p64_1:5;   --> prefetchable port 1
> > +	 * np32_2:1;  --> non-prefetchable port 2
> > +	 * p64_2:5;   --> prefetchable port 2
> > +	 * np32_3:1;  --> non-prefetchable port 3
> > +	 * p64_3:5;   --> prefetchable port 3
> > +	 * np32_4:1;  --> non-prefetchable port 4
> > +	 * p64_4:5;   --> prefetchable port 4
> > +	 * reserved:2;
> > +	 */
> > +	unsigned int port_bitmap;
> > +
> > +	u32 mem_res_bitmap = 0;
> > +	unsigned int ds_port_offset = 0;
> > +	unsigned short multiplier = 0;
> > +	unsigned short np_size = 0;
> > +
> > +	/*
> > +	 * PLX8713 used on FC4 and FC8
> > +	 * PLX8725 used on FC12 and FC18
> > +	 */
> > +	if (!dev || dev->vendor != PCI_VENDOR_ID_PLX ||
> > +			((dev->device & 0xFF00) != 0x8700))
> > +		return size;
> > +
> > +	tmp_bus = bus;
> > +	while (tmp_bus->parent) {
> > +		tmp_bus = tmp_bus->parent;
> > +		depth++;
> > +	}
> > +
> > +	/* Only for Second level bridges */
> > +	if (depth != 5)
> > +		return size;
> > +
> > +	nt_virt_dev = plx_find_nt_device(bus->parent, 0x87b0);
> > +	if (nt_virt_dev) {
> > +		pci_read_config_word(nt_virt_dev, PLX_RES_MAGIC_OFFSET,
> > +				&res_magic);
> > +		dev_dbg(&nt_virt_dev->dev,
> > +				"Magic offset of 0x%x found in NT device\n", res_magic);
> > +	}
> > +
> > +	if (res_magic == PLX_RES_MAGIC_VALUE) {
> > +		/*
> > +		 * The pacifics are connected on PLX ports:
> > +		 *  FC4 and FC8: #3, #4
> > +		 *  FC12       : #3, #4, #5
> > +		 *  FC18       : #3, #4, #5, #11
> > +		 */
> > +
> > +		/* Calculate resource based on EEPROM values */
> > +		ds_port_offset = (bus->number - bus->parent->number) - 1;
> > +		if (ds_port_offset < 5) {
> > +			pci_read_config_dword(nt_virt_dev, PLX_RES_DS_PORT_REG0,
> > +					&mem_res_bitmap);
> > +		} else {
> > +			ds_port_offset -= 5;
> > +			pci_read_config_dword(nt_virt_dev, PLX_RES_DS_PORT_REG1,
> > +					&mem_res_bitmap);
> > +		}
> > +		port_bitmap = mem_res_bitmap;
> > +		dev_dbg(&bus->dev, "Port offset: 0x%x, res bitmap 0x%x\n",
> > +				ds_port_offset, mem_res_bitmap);
> > +
> > +		if (ds_port_offset < 5) {
> > +			u8 m[] = { 26, 20, 14, 8, 2 };
> > +			u8 s[] = { 31, 25, 19, 13, 7 };
> > +
> > +			multiplier = (port_bitmap >> m[ds_port_offset]) & PLX_RES_P_MASK;
> > +			np_size = (port_bitmap >> s[ds_port_offset]) & PLX_RES_NP_MASK;
> > +
> > +			dev_dbg(&bus->dev, "Multiplier: %d, np_size: %d\n",
> > +					multiplier, np_size);
> > +
> > +			if (res_type & IORESOURCE_PREFETCH) {
> > +				size = 0x100000 << (multiplier - 1);
> > +				dev_dbg(&bus->dev, "Pref Multiplier %d, Size 0x%llx\n",
> > +						multiplier, (long long) size);
> > +			} else if (np_size) {
> > +				size = 0x100000;
> > +				dev_dbg(&bus->dev, "NP Size 0x%llx\n", (long long) size);
> > +			}
> > +		}
> > +	}
> > +	return size;
> > +}
> > +
> >   /**
> >    * pbus_size_mem() - Size the memory window of a given bus
> >    *
> > @@ -1001,6 +1143,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> >   	resource_size_t children_add_size = 0;
> >   	resource_size_t children_add_align = 0;
> >   	resource_size_t add_align = 0;
> > +	unsigned int dev_count = 0;
> >   	if (!b_res)
> >   		return -ENOSPC;
> > @@ -1016,6 +1159,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> >   	list_for_each_entry(dev, &bus->devices, bus_list) {
> >   		int i;
> > +		dev_count++;
> >   		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> >   			struct resource *r = &dev->resource[i];
> >   			resource_size_t r_size;
> > @@ -1071,6 +1215,21 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> >   		}
> >   	}
> > +	/* Static allocation for FC pacific */
> > +	if (!size && !dev_count) {
> > +		size = pci_get_plx_downstream_res_size(bus, type);
> > +		if (size) {
> > +			order = __ffs(size);
> > +			dev_dbg(&bus->self->dev, "order for %llx is %u\n", (long long) size, order);
> > +			if ((order >= 20) &&
> > +					((order -= 20) < ARRAY_SIZE(aligns)) &&
> > +					(order > max_order)) {
> > +				max_order = order;
> > +				dev_dbg(&bus->self->dev, "max_order reset to %d; size %zx\n", max_order, (size_t)size);
> > +			}
> > +		}
> > +	}
> > +
> >   	min_align = calculate_mem_align(aligns, max_order);
> >   	min_align = max(min_align, window_alignment(bus, b_res->flags));
> >   	size0 = calculate_memsize(size, min_size, 0, 0, resource_size(b_res), min_align);
> 
> This is basically a request for comments, how to deal with such
> hardware, which is going to run Linux based network operating
> systems (NOS) like SONiC.

I don't think this patch is practical for upstream as-is because it
inserts so much device-specific stuff in generic code.  That won't
scale when we try to extend it for other similar devices.

I think the long-term solution is something like Sergey's work on
movable BARs [1], but that needs testing and review and hasn't been
merged yet.

In the short term, you might be able to work around this with the
"pci=resource_alignment=" or "pci=hpmmiosize=" kernel parameters.

[1] https://lore.kernel.org/linux-pci/20191024171228.877974-1-s.miroshnichenko@yadro.com/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH] PCI: Reserve address space for powered-off devices behind PCIe bridges
       [not found] <20210713073124.177027-1-pmenzel@molgen.mpg.de>
@ 2021-07-15 13:50 ` Paul Menzel
  2021-07-15 14:07   ` Bjorn Helgaas
  0 siblings, 1 reply; 7+ messages in thread
From: Paul Menzel @ 2021-07-15 13:50 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Guohan Lu, Billie Alsup, Madhava Reddy Siddareddygari, linux-pci,
	linux-kernel, David S. Miller, Jakub Kicinski, netdev

[Add Billie’s correct address.]

Am 13.07.21 um 09:31 schrieb Paul Menzel:
> From: balsup <balsup@contoso.com>

Billie Alsup <balsup@cisco.com>

> Data path devices are powered off by default, they will not be visible at
> BIOS stage and memory for these devices is not reserved.
> 
> By default, no address space would be reserved on the bridges for these
> unpowered devices. When they were powered up, they could fail to initialize
> because there was no appropriately aligned window available for a given
> BAR.
> 
> This patch will reserve address space for data path devices that are behind
> PCIe bridge, so that when devices are available PCIe subsystem will be
> assign the address within the specified range.
> 
> Signed-off-by: Madhava Reddy Siddareddygari <msiddare@cisco.com>
> ---
> This patch was submitted to the SONiC project for a Cisco device [1].
> It’s better to have it reviewed and committed upstream though.
> 
>   drivers/pci/setup-bus.c | 159 ++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 159 insertions(+)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index 2ce636937c6e..266097984e19 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -967,6 +967,148 @@ static inline resource_size_t calculate_mem_align(resource_size_t *aligns,
>   	return min_align;
>   }
>   
> +#define PLX_RES_MAGIC_VALUE            0xABBA
> +#define PLX_RES_DS_PORT_REG0           0xC6C
> +#define PLX_RES_DS_PORT_REG1           0xC70
> +#define PLX_RES_MAGIC_OFFSET           0xC76
> +#define PLX_RES_NP_MASK                0x1
> +#define PLX_RES_P_MASK                 0x1F
> +
> +static struct pci_dev *
> +plx_find_nt_device(struct pci_bus *bus, unsigned short brg_dev_id)
> +{
> +	struct pci_dev *dev, *nt_virt_dev = NULL;
> +	struct pci_bus *child_bus;
> +	unsigned short vendor, devid, class;
> +
> +	if (!bus)
> +		return NULL;
> +
> +	list_for_each_entry(child_bus, &bus->children, node) {
> +		list_for_each_entry(dev, &child_bus->devices, bus_list) {
> +			vendor = dev->vendor;
> +			devid = dev->device;
> +			class = dev->class >> 8;
> +
> +			if ((vendor == PCI_VENDOR_ID_PLX) &&
> +					(brg_dev_id == devid) &&
> +					(class == PCI_CLASS_BRIDGE_OTHER)) {
> +				dev_dbg(&dev->dev, "Found NT device 0x%x\n",
> +						devid);
> +				nt_virt_dev = dev;
> +				break;
> +			}
> +		}
> +
> +		if (nt_virt_dev)
> +			break;
> +	}
> +	return nt_virt_dev;
> +}
> +
> +static resource_size_t
> +pci_get_plx_downstream_res_size(struct pci_bus *bus, unsigned long res_type)
> +{
> +	int depth = 0;
> +	resource_size_t size = 0;
> +	struct pci_dev *dev = bus->self;
> +	struct pci_bus *tmp_bus;
> +	struct pci_dev *nt_virt_dev;
> +	u16 res_magic = 0;
> +
> +	/*
> +	 * 32 bits to store the memory requirement for PLX ports.
> +	 * Following is the layout:
> +	 * np32_0:1;  --> non-prefetchable port 0
> +	 * p64_0:5;   --> prefetchable port 0
> +	 * np32_1:1;  --> non-prefetchable port 1
> +	 * p64_1:5;   --> prefetchable port 1
> +	 * np32_2:1;  --> non-prefetchable port 2
> +	 * p64_2:5;   --> prefetchable port 2
> +	 * np32_3:1;  --> non-prefetchable port 3
> +	 * p64_3:5;   --> prefetchable port 3
> +	 * np32_4:1;  --> non-prefetchable port 4
> +	 * p64_4:5;   --> prefetchable port 4
> +	 * reserved:2;
> +	 */
> +	unsigned int port_bitmap;
> +
> +	u32 mem_res_bitmap = 0;
> +	unsigned int ds_port_offset = 0;
> +	unsigned short multiplier = 0;
> +	unsigned short np_size = 0;
> +
> +	/*
> +	 * PLX8713 used on FC4 and FC8
> +	 * PLX8725 used on FC12 and FC18
> +	 */
> +	if (!dev || dev->vendor != PCI_VENDOR_ID_PLX ||
> +			((dev->device & 0xFF00) != 0x8700))
> +		return size;
> +
> +	tmp_bus = bus;
> +	while (tmp_bus->parent) {
> +		tmp_bus = tmp_bus->parent;
> +		depth++;
> +	}
> +
> +	/* Only for Second level bridges */
> +	if (depth != 5)
> +		return size;
> +
> +	nt_virt_dev = plx_find_nt_device(bus->parent, 0x87b0);
> +	if (nt_virt_dev) {
> +		pci_read_config_word(nt_virt_dev, PLX_RES_MAGIC_OFFSET,
> +				&res_magic);
> +		dev_dbg(&nt_virt_dev->dev,
> +				"Magic offset of 0x%x found in NT device\n", res_magic);
> +	}
> +
> +	if (res_magic == PLX_RES_MAGIC_VALUE) {
> +		/*
> +		 * The pacifics are connected on PLX ports:
> +		 *  FC4 and FC8: #3, #4
> +		 *  FC12       : #3, #4, #5
> +		 *  FC18       : #3, #4, #5, #11
> +		 */
> +
> +		/* Calculate resource based on EEPROM values */
> +		ds_port_offset = (bus->number - bus->parent->number) - 1;
> +		if (ds_port_offset < 5) {
> +			pci_read_config_dword(nt_virt_dev, PLX_RES_DS_PORT_REG0,
> +					&mem_res_bitmap);
> +		} else {
> +			ds_port_offset -= 5;
> +			pci_read_config_dword(nt_virt_dev, PLX_RES_DS_PORT_REG1,
> +					&mem_res_bitmap);
> +		}
> +		port_bitmap = mem_res_bitmap;
> +		dev_dbg(&bus->dev, "Port offset: 0x%x, res bitmap 0x%x\n",
> +				ds_port_offset, mem_res_bitmap);
> +
> +		if (ds_port_offset < 5) {
> +			u8 m[] = { 26, 20, 14, 8, 2 };
> +			u8 s[] = { 31, 25, 19, 13, 7 };
> +
> +			multiplier = (port_bitmap >> m[ds_port_offset]) & PLX_RES_P_MASK;
> +			np_size = (port_bitmap >> s[ds_port_offset]) & PLX_RES_NP_MASK;
> +
> +			dev_dbg(&bus->dev, "Multiplier: %d, np_size: %d\n",
> +					multiplier, np_size);
> +
> +			if (res_type & IORESOURCE_PREFETCH) {
> +				size = 0x100000 << (multiplier - 1);
> +				dev_dbg(&bus->dev, "Pref Multiplier %d, Size 0x%llx\n",
> +						multiplier, (long long) size);
> +			} else if (np_size) {
> +				size = 0x100000;
> +				dev_dbg(&bus->dev, "NP Size 0x%llx\n", (long long) size);
> +			}
> +		}
> +	}
> +	return size;
> +}
> +
>   /**
>    * pbus_size_mem() - Size the memory window of a given bus
>    *
> @@ -1001,6 +1143,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>   	resource_size_t children_add_size = 0;
>   	resource_size_t children_add_align = 0;
>   	resource_size_t add_align = 0;
> +	unsigned int dev_count = 0;
>   
>   	if (!b_res)
>   		return -ENOSPC;
> @@ -1016,6 +1159,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>   	list_for_each_entry(dev, &bus->devices, bus_list) {
>   		int i;
>   
> +		dev_count++;
>   		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
>   			struct resource *r = &dev->resource[i];
>   			resource_size_t r_size;
> @@ -1071,6 +1215,21 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>   		}
>   	}
>   
> +	/* Static allocation for FC pacific */
> +	if (!size && !dev_count) {
> +		size = pci_get_plx_downstream_res_size(bus, type);
> +		if (size) {
> +			order = __ffs(size);
> +			dev_dbg(&bus->self->dev, "order for %llx is %u\n", (long long) size, order);
> +			if ((order >= 20) &&
> +					((order -= 20) < ARRAY_SIZE(aligns)) &&
> +					(order > max_order)) {
> +				max_order = order;
> +				dev_dbg(&bus->self->dev, "max_order reset to %d; size %zx\n", max_order, (size_t)size);
> +			}
> +		}
> +	}
> +
>   	min_align = calculate_mem_align(aligns, max_order);
>   	min_align = max(min_align, window_alignment(bus, b_res->flags));
>   	size0 = calculate_memsize(size, min_size, 0, 0, resource_size(b_res), min_align);

This is basically a request for comments, how to deal with such 
hardware, which is going to run Linux based network operating systems 
(NOS) like SONiC.

Would hotplugging be an option, and could you give pointers?


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-07-15 19:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <BYAPR11MB3527AEB1E4969C0833D1697ED9129@BYAPR11MB3527.namprd11.prod.outlook.com>
2021-07-15 17:13 ` [RFC][PATCH] PCI: Reserve address space for powered-off devices behind PCIe bridges Bjorn Helgaas
2021-07-15 18:12   ` Billie Alsup (balsup)
2021-07-15 18:56     ` Bjorn Helgaas
2021-07-15 19:49       ` Billie Alsup (balsup)
2021-07-15 19:56         ` Bjorn Helgaas
     [not found] <20210713073124.177027-1-pmenzel@molgen.mpg.de>
2021-07-15 13:50 ` Paul Menzel
2021-07-15 14:07   ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).