LKML Archive on lore.kernel.org
help / color / mirror / Atom feed
* Re: __get_free_pages(): is the MEM really mine?
@ 2001-09-27 10:06 Bernd Harries
  2001-09-27 13:00 ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Bernd Harries @ 2001-09-27 10:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo

Thanks, Ingo.

> perfectly legal - but there is no guarantee you will succeed getting two
> nearby 2 MB pages. You will get it if your driver initializes during

Yes, I try until I have enough (2 in 2.4.x, 32 in 2.2.x) contig chunks and
free the rest again. But statistically I do get 2 contig chunks immediately
quite often if the X11 is not already running. 256 MB is in the box

> > When I run the user appl. again after short time I mostly get the same
> > chunk of physical memory (virt_to_bus is identical!)
> 
> have you perpahs freed that page?

Yes, as expected.

> printk every occasion of
> allocating/freeing a 2 MB buffer and i'm sure you'll see the problem.
> (Perhaps it's the close() implicitly done by exit() that frees the
> buffer?)

Yes, that is definitely the case and I expect it. 

> >  Now close '/dev/aprsc027' fd = 3 ...

The test program does a close and on the console 

But I tend to conclude from getting the same phys address again after some
time that noone else uses much memory inbetween. Plus, the first page of the
area stays Zero all the time while the higher pages seem to be used by
someone. I know that this is no prove that the 1st page was really not used
otherwise but... And I know it is also legal that other procs use the very same RAM.
But the prob is that the system gets unstable. And it doesn't get unstable if
order count is 0, which i use in minor 0..23 to allocate a small kernel
buffer. Only minor 26 and 27 allocate a 4 MB contig buffer in open() and mmap
that buffer to user space, while minor 28 and 29 only allocate a small buffer to
write the FIFOs and mmap the 32 MB PCI area of the card.

The impression I have is that only large allocations behave strangely. But
the instability is not visible immediately. Too bad. Only after some time do I
see strange behaviour of the system. But I think I don't see them if I only
use the functionality of the minors with smaller buffers.

could my nopage() method be inmplemented wrongly?
I read Alessandro Rubini's book I learned how to implement it:

  chn_ptr = (struct RSC_SOFTC *)vma_ptr->vm_private_data;
  card_ptr = chn_ptr->card_ptr;
  minor = chn_ptr->minor;
  card_chn = minor & APRSC_CARD_CHNS_MASK;

  page_ptr = NOPAGE_SIGBUS;
  
  Iprintf(" address=$%08lX ad - vm_start=$%08lX VMA_OFFSET=$%08lX \n",
    address,
    address - vma_ptr->vm_start,
    vma_ptr->vm_pgoff << PAGE_SHIFT);
  
  offset = address - vma_ptr->vm_start + (vma_ptr->vm_pgoff << PAGE_SHIFT);

  if(card_chn == APRSC_DEV_PER_CARD - 6)  /* Bild 1 Ch 26 dieser Karte */
  {
    if(offset > card_ptr->contig_len0)
    {
      return(page_ptr);
    }
    /*endif()*/
    start = (ULONG)card_ptr->dma_mem0;
  }
  else if(card_chn == APRSC_DEV_PER_CARD - 5)  /* Bild 2 Ch 27 dieser Karte
*/
  {
    if(offset > card_ptr->contig_len1)
    {
      return(page_ptr);
    }
    /*endif()*/
    start = (ULONG)card_ptr->dma_mem1;
  }
  else
  {
    return(page_ptr);
  }
  /*endif(card_chn == APRSC_DEV_PER_CARD - [(>=7), (<=4)] usw.)*/
  page_ptr = virt_to_page(start + offset);
  Iprintf(" start+off=$%08lX page_ptr=$%8p \n",
    start + offset,
    page_ptr);
  get_page(page_ptr);
  
  return(page_ptr);



Here is the console output of my driver during the test program:

Sep 27 11:43:28 pcma73 kernel: rsc_open() minor=$1B 
Sep 27 11:43:28 pcma73 kernel:  DMA blk 0 at KV:$CE800000 BUS:$0E800000 
Sep 27 11:43:28 pcma73 kernel:  DMA blk 1 at KV:$CE600000 BUS:$0E600000
contig < 
Sep 27 11:43:28 pcma73 kernel:  Max Buffer Frag at BUS:$0E600000 len
$00400000 bytes 
Sep 27 11:43:28 pcma73 kernel:  Collected DMA Buffer1 at KS:$0000CE600000
BUS:$0E600000 len $00400000 bytes 
Sep 27 11:43:28 pcma73 kernel: rsc_ioctl()
Sep 27 11:43:28 pcma73 kernel:  RSC_IOC_GET_FIX: copy_to_user() returned $0 
Sep 27 11:43:28 pcma73 kernel: rsc_ioctl()
Sep 27 11:43:28 pcma73 kernel: rsc_mmap()  minor=$1B  offset=$00000000 
Sep 27 11:43:28 pcma73 kernel: rsc_vma_open()
Sep 27 11:43:28 pcma73 kernel: rsc_nopage()
Sep 27 11:43:28 pcma73 kernel:  address=$40132000 ad - vm_start=$00000000
VMA_OFFSET=$00000000 
Sep 27 11:43:28 pcma73 kernel:  start+off=$CE600000 page_ptr=$c1398000 
Sep 27 11:43:28 pcma73 kernel: rsc_nopage()
Sep 27 11:43:28 pcma73 kernel:  address=$40134000 ad - vm_start=$00002000
VMA_OFFSET=$00000000 
Sep 27 11:43:28 pcma73 kernel:  start+off=$CE602000 page_ptr=$c1398080 
Sep 27 11:43:28 pcma73 kernel: rsc_ioctl()
Sep 27 11:43:28 pcma73 kernel:  RSC_IOC_DMA_OUT
Sep 27 11:43:28 pcma73 kernel: rsc_vma_close()
Sep 27 11:43:28 pcma73 kernel: rsc_close()
Sep 27 11:43:28 pcma73 kernel:  PCIRSC: DMA0CSR=$10 ok.
Sep 27 11:43:28 pcma73 kernel:  PCIRSC: DMA1CSR=$10 ok.
Sep 27 11:43:28 pcma73 kernel:  PCIRSC: PCISR=$0290 ok.
Sep 27 11:43:28 pcma73 kernel:  Free DMA blk 0 at KS:$CE800000 
Sep 27 11:43:28 pcma73 kernel:  Free DMA blk 1 at KS:$CE600000 


Thank you very much for your help!

-- 
Bernd Harries

bha@gmx.de            http://bharries.freeyellow.com
bharries@web.de       Tel. +49 421 809 7343 priv.  | MSB First!
harries@stn-atlas.de       +49 421 457 3966 offi.  | Linux-m68k
bernd@linux-m68k.org       +49 172 139 6054 handy  | Medusa T40

GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
  2001-09-27 10:06 __get_free_pages(): is the MEM really mine? Bernd Harries
@ 2001-09-27 13:00 ` Ingo Molnar
  2001-09-29 17:15   ` Bernd Harries
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2001-09-27 13:00 UTC (permalink / raw)
  To: Bernd Harries; +Cc: linux-kernel


On Thu, 27 Sep 2001, Bernd Harries wrote:

> > have you perpahs freed that page?
>
> Yes, as expected.

well - what did you expect to happen? A freed page is going to be reused
for other purposes. A big 2MB allocation can be reused in part, once
memory usage grows. So you should not expect the device to be able to DMA
into a page that got freed, unpunished. Perhaps i'm misunderstanding the
problem.

> But I tend to conclude from getting the same phys address again after
> some time that noone else uses much memory inbetween. Plus, the first
> page of the area stays Zero all the time while the higher pages seem
> to be used by someone. [...]

the buddy allocator allocates top down. Plus, if you allocate a 2MB
physically continuous chunk then the likelyhood is high that there were
fragmented pages skipped during the initial search for a 2MB block - so
you still have a fair likelyhood to reallocate it after some time, if
memory usage is light. But this likelyhood nears zero once RAM usage gets
near 100%.

	Ingo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
  2001-09-27 13:00 ` Ingo Molnar
@ 2001-09-29 17:15   ` Bernd Harries
  2001-09-30  7:27     ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Bernd Harries @ 2001-09-29 17:15 UTC (permalink / raw)
  To: mingo; +Cc: Bernd Harries, linux-kernel

Greetings from the 2001 Linux Devel meeting in Oldenburg!

Roman Zippel looked at my driver and added code to print the usage 
counter for each page after a 9-order __get_free_pages().

We found that only the first (!) page has a count of 1, the others have 0!

That would cover my impression, that only the 1st page is really mine...

Roman found that strange and added this:

          struct page * page = virt_to_page(card_ptr->dma_blk1[n]);
          int i;
          for(i = 0; i < (1 << max_order); i++, page++)
          {
            atomic_set(&page->count, 1);
          }

And the freeing of the pages is now done page by page in the _vma_close()
function.

I will now test the version but I have only a 1-CPU box here. On an SMP Box I
could imagine that even between __get_free_pages() and the
atomic_set(&page->count, 1) someone else already uses my pages.

Could you please comment on this?

Thanks,
-- 
Bernd Harries

bha@gmx.de           http://www.freeyellow.com/members/bharries
bha@nikocity.de       Tel. +49 421 809 7343 priv.  | MSB First!
harries@stn-atlas.de       +49 421 457 3966 offi.  | Linux-m68k
bernd@linux-m68k.org      8.48'21" E  52.48'52" N  | Medusa T40

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
  2001-09-29 17:15   ` Bernd Harries
@ 2001-09-30  7:27     ` Ingo Molnar
  2001-09-30 12:59       ` Bernd Harries
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2001-09-30  7:27 UTC (permalink / raw)
  To: Bernd Harries; +Cc: Bernd Harries, linux-kernel


On Sat, 29 Sep 2001, Bernd Harries wrote:

> Roman Zippel looked at my driver and added code to print the usage
> counter for each page after a 9-order __get_free_pages().
>
> We found that only the first (!) page has a count of 1, the others
> have 0!

This is a property of Linux's buddy allocator. If you allocate a 9th order
'big page', that does not mean you can free the pages one by one. Higher
order pages are 'one unit' and are typically handled as such. Eg. the
kernel stack is allocated and freed as order 1 pages.

>           struct page * page = virt_to_page(card_ptr->dma_blk1[n]);
>           int i;
>           for(i = 0; i < (1 << max_order); i++, page++)
>           {
>             atomic_set(&page->count, 1);
>           }
>
> And the freeing of the pages is now done page by page in the _vma_close()
> function.

while unconventional, doing this is safe. There is nothing in the page
structure that says that the page was allocated as a higher order page. So
if you fix up the page counts, freeing them as separate entities is safe.
(in fact it's even safe to split it up into 8k or 16k pages - not that
this would be useful for you.) But the above is an 'internal' property of
the Linux page allocator, so it's not guaranteed to stay so forever.

(the Linux kernel does not do the above for understandable reasons: it
takes a loop of 512 iterations to fix up the page counts in the above way,
which is noticeable runtime overhead.)

is it a fundamental property of the hardware that it needs a continuous
physical memory buffer? If not then i'd strongly suggest updating the
driver to do scatter-gather instead of trying to allocate a 2 MB page.
Being able to allocate a 2 MB page is only guaranteed during bootup. There
is just no mechanizm in Linux that guarantees it for you to be able to
allocate a 2 MB page (let alone two adjacent 2 MB pages), in even a
moderately utilized system. Scatter-gather avoids all these problems.

	Ingo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
  2001-09-30  7:27     ` Ingo Molnar
@ 2001-09-30 12:59       ` Bernd Harries
  2001-10-01  5:55         ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Bernd Harries @ 2001-09-30 12:59 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel

Ingo Molnar wrote:

> This is a property of Linux's buddy allocator. If you allocate a 9th order
> 'big page', that does not mean you can free the pages one by one.

I used  free_pages((ULONG)card_ptr->dma_blk0[n], max_order); before Roman 
changed it. And still do for minor 26 now...

> while unconventional, doing this is safe. There is nothing in the page
> structure that says that the page was allocated as a higher order page.

> But the above is an 'internal' property of
> the Linux page allocator, so it's not guaranteed to stay so forever.

Thats why I don't like it so much. But it seems I must do it for some strange
reason:

On minor 26 I do it the old way, on minor 27 I use Romans fix. What shall I say:
Reading and writing to the buffer allocated with Roman's fix so far never
crashed the system. But doing it the normal way (minor 26) how I also learned it
from A. Rubini's book, 
does harm to the system.

After usage of the normally allocated buffer the strangest thing occur:
- Issing w caused a dump on the console once.
- Halt doesn't really halt the system completely
- Reboot caused everything to hang, partitions still dirty...


> (the Linux kernel does not do the above for understandable reasons: it
> takes a loop of 512 iterations to fix up the page counts in the above way,
> which is noticeable runtime overhead.)

Oh yes, indeed! But:

Is there a guarantee that the n - 1 pages above the 1st one are not donated to
other programs while my driver uses them?


> is it a fundamental property of the hardware that it needs a continuous
> physical memory buffer?

Yes. The FW on the card demands it.

> Being able to allocate a 2 MB page is only guaranteed during bootup. There
> is just no mechanizm in Linux that guarantees it for you to be able to
> allocate a 2 MB page (let alone two adjacent 2 MB pages), in even a
> moderately utilized system. Scatter-gather avoids all these problems.

I'll move the code to init_module later once it is stable.

Ciao,
-- 
Bernd Harries

bha@gmx.de           http://www.freeyellow.com/members/bharries
bha@nikocity.de       Tel. +49 421 809 7343 priv.  | MSB First!
harries@stn-atlas.de       +49 421 457 3966 offi.  | Linux-m68k
bernd@linux-m68k.org      8.48'21" E  52.48'52" N  | Medusa T40
           <>_<>      _______                _____
       .---|'"`|---. |  |_|  |_|_|_|_|_|_|_ (_____)  .-----.
______`o"O-OO-OO-O"o'`-o---o-'`-oo-----oo-'`-o---o-'`-o---o-'___

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
  2001-09-30 12:59       ` Bernd Harries
@ 2001-10-01  5:55         ` Ingo Molnar
  2001-10-05  8:49           ` Bernd Harries
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2001-10-01  5:55 UTC (permalink / raw)
  To: Bernd Harries; +Cc: linux-kernel


On Sun, 30 Sep 2001, Bernd Harries wrote:

> Is there a guarantee that the n - 1 pages above the 1st one are not
> donated to other programs while my driver uses them?

yes. The 2MB block of 512 x 4k pages (we should perhaps call it a 'order 9
page') is yours.

> > is it a fundamental property of the hardware that it needs a continuous
> > physical memory buffer?
>
> Yes. The FW on the card demands it.

ok. then i'd suggest to do all this allocation at boot-time, and do not
deallocate it. This is the safest method. Unless it's a point to have the
driver as a module (for other than development purposes).

> I'll move the code to init_module later once it is stable.

even init_module() can be executed much later: eg. kmod removes the module
because it's unused, and it's reinserted later. So generally it's really
unrobust to expect a 9th order allocation to succeed at module_init()
time.

the fundamental issue is not the lazyness of Linux VM developers. 99.9% of
all allocations are order 0. 99.9% of the remaining allocations are order
1 or 2. It takes a fair amount of overhead and complexity to handle
high-order allocations 'well' - it takes even more effort (and a perverse
limitation on the use of pointers) to guarantee the success of such
allocations all the time.

there is a longer-term and robust solution that could be used though. We
could support a generic 'physical memory pool', that gets allocated on
bootup (via eg. a physmem=10m kernel boot option), and never gets used for
other than such critical allocations. Your driver could call eg.
alloc_physmem(size) and free_physmem(). It would work similarly to
bootmem.c. This 'physical memory pool' would never be used by generic
subsystems - only drivers which support hardware with such limitations are
allowed to use it. The advantage of this approach is that there would be
one generic way to put physically continuous RAM aside for such drivers -
so the driver would not have to worry about the VM situation. The other
advantage is that we could decrease MAX_ORDER significantly (to around 7)
- support for higher orders increases the runtime overhead of the buddy
allocator, even for low-order allocations.

(later on we could even add support to grow and shrink the size of the
physical memory pool (within certain boundaries), so it could be sized
boot-time.)

would anything like this be useful? Since it's a completely separate pool
(in fact it wont even show up in the normal memory statistics), it does
not disturb the existing VM in any way.

	Ingo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
  2001-10-01  5:55         ` Ingo Molnar
@ 2001-10-05  8:49           ` Bernd Harries
  0 siblings, 0 replies; 17+ messages in thread
From: Bernd Harries @ 2001-10-05  8:49 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel

Hi Ingo,

The problem with mmapping a Kernel buffer to userspace is still there .

It appears that __get_free_pages(GFP_KERNEL, max_order) alone is not enough to request a reliable buffer. On Monday I already sent a message to
the list which you may have overseen.

In my driver I have now the normal method on minor 26 and Roman Zippel's method on minor 27. I have used minor 27 quite heavy already and it
appears stable. Using minor 26 makes the system instable quite instantly.

I would like you to try my driver either on my system via remote login or I could try to reproduce the effect without DMA accesses to the buffer
and modify the driver so that you can try it without hardware in your Computer.

Is one of these 2 ways possible for you?

Thanks,
-- 
Bernd Harries

bha@gmx.de            http://bharries.freeyellow.com
bharries@web.de       Tel. +49 421 809 7343 priv.  | MSB First!
harries@stn-atlas.de       +49 421 457 3966 offi.  | Linux-m68k
bernd@linux-m68k.org       +49 172 139 6054 handy  | Medusa T40

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
  2001-10-05 13:32   ` Bernd Harries
@ 2001-10-05 15:27     ` Hugh Dickins
  0 siblings, 0 replies; 17+ messages in thread
From: Hugh Dickins @ 2001-10-05 15:27 UTC (permalink / raw)
  To: Bernd Harries; +Cc: linux-kernel, mingo

On Fri, 5 Oct 2001, Bernd Harries wrote:
> Hugh Dickins wrote:
> 
> > I don't
> > know whether you're following the mmap-makes-all-pages-present
> > model (using remap_page_range), or the fault-page-by-page model
> > (supplying your own nopage function). 
> 
> The nopage method. In Alessandro Rubini's book (p.391) I read, that
> I can't use remap_page_range() on pages optained by get_free_page().

I just looked that up.  Rubini is right that remap_page_range only
works as you'd want on reserved pages, and pages which fail the
VALID_PAGE(page) test (I'm trying to avoid saying "invalid pages"),
and there is a good reason for that.  But Rubini omits to mention
mem_map_reserve, which can be used (on pages you own exclusively)
to mark a page as temporarily reserved, so remap_page_range will
then work as you'd want on it (with mem_map_unreserve to undo later).

The mem_map_reserve, remap_page_range model is commoner in drivers
than the nopage model; but it is somewhat deprecated now, Linus for
one certainly preferring the nopage model; and the VM_RESERVED vma
flag can give pages that immunity from swap_out which mem_map_reserve
also confers.  You're not wrong to follow the nopage model.

> Hmm, the only thing that happens to them after munmap() is 
> free_pages(). I don't access the pages anymore. But maybe some code in free_pages does? Or decrements count to -1?

I've forgotten by now what your precise symptoms were.  But either
pages would be freed twice and allocated twice; or they would hit a
BUG() statement in second free or second allocation; neither good.

> > Either you should force page count 1 on each of the 
> > order-0-pages before you mmap them in (and raise count to 2);
> 
> by get_page(), right?

Fine; and I expect you'll need to undo it later by appropriate put_page()s.

> Ok, thanks a lot. So it's definitely insufficient how my minor 26 version handles the pages, right? If so, that's a statement I can live with.
> 
> And it was never ment that I could simply mmap the upper pages to userspace directly, without 'touching' each page, was it? 

Probably all the drivers which use higher order allocations are using
the older, mem_map_reserve + remap_page_range method; the reserved
bit preserves a page against freeing whatever its page count.  Maybe
you're the first to use the nopage method on a higher order allocation
(or maybe not, and there are already drivers working around it).

I wouldn't claim the way it is currently is ideal design: I think
you've hit a not entirely satisfactory but easily worked around oddity,

Hugh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
  2001-10-05 12:55 ` Hugh Dickins
@ 2001-10-05 13:32   ` Bernd Harries
  2001-10-05 15:27     ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Bernd Harries @ 2001-10-05 13:32 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, mingo

Hugh Dickins wrote:


> I don't
> know whether you're following the mmap-makes-all-pages-present
> model (using remap_page_range), or the fault-page-by-page model
> (supplying your own nopage function). 

The nopage method. In Alessandro Rubini's book (p.391) I read, that I can't use remap_page_range() on pages optained by get_free_page().

> But either way it sounds like
> you bump each page count by 1 when you map it in, and then when > it's unmapped the count goes down to 0 on all the later 
> order-0-pages,

exactly that happens in the version I use on minor 26 today.

> so they get freed before you're done with them.

Hmm, the only thing that happens to them after munmap() is 
free_pages(). I don't access the pages anymore. But maybe some code in free_pages does? Or decrements count to -1?

> Either you should force page count 1 on each of the 
> order-0-pages before you mmap them in 

Yes, I do that in the version used in minor 27 today right after the allocation.

> (and raise count to 2);

by get_page(), right?

> or you should set
> the Reserved bit on each them, and clear it before freeing 
> (see use of mem_map_reserve and mem_map_unreserve in various 
> drivers/sound
> sources using remap_page_range; there's also a couple of 
> examples of the nopage method down there too).

Ok, thanks a lot. So it's definitely insufficient how my minor 26 version handles the pages, right? If so, that's a statement I can live with.

And it was never ment that I could simply mmap the upper pages to userspace directly, without 'touching' each page, was it? 

Ciao,
-- 
Bernd Harries

bha@gmx.de            http://bharries.freeyellow.com
bharries@web.de       Tel. +49 421 809 7343 priv.  | MSB First!
harries@stn-atlas.de       +49 421 457 3966 offi.  | Linux-m68k
bernd@linux-m68k.org       +49 172 139 6054 handy  | Medusa T40

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
  2001-10-01 11:33 Bernd Harries
@ 2001-10-05 12:55 ` Hugh Dickins
  2001-10-05 13:32   ` Bernd Harries
  0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2001-10-05 12:55 UTC (permalink / raw)
  To: Bernd Harries; +Cc: linux-kernel, mingo

On Mon, 1 Oct 2001, Bernd Harries wrote:
> 
> I wonder why only I see problems so far. Maybe it's because I also mmap()
> that RAM to user space?

Probably.

munmap() will handle each order-0-page of your order-9
allocation separately.  __get_free_pages gave you count 1 on the
first of those order-0-pages, leaving count 0 on the rest.  I don't
know whether you're following the mmap-makes-all-pages-present
model (using remap_page_range), or the fault-page-by-page model
(supplying your own nopage function).  But either way it sounds like
you bump each page count by 1 when you map it in, and then when it's
unmapped the count goes down to 0 on all the later order-0-pages,
so they get freed before you're done with them.

Either you should force page count 1 on each of the order-0-pages
before you mmap them in (and raise count to 2); or you should set
the Reserved bit on each them, and clear it before freeing (see use
of mem_map_reserve and mem_map_unreserve in various drivers/sound
sources using remap_page_range; there's also a couple of examples
of the nopage method down there too).

Hugh


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
@ 2001-10-01 11:33 Bernd Harries
  2001-10-05 12:55 ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Bernd Harries @ 2001-10-01 11:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo

Ingo Molnar wrote:

> > Is there a guarantee that the n - 1 pages above the 1st one are not
> > donated to other programs while my driver uses them?
> 
> yes. The 2MB block of 512 x 4k pages (we should perhaps call it a 'order 9
> page') is yours.

I think I have to demonstrate to you how my driver behaves in reality.

Too bad the driver would in the moment not allow any open() without at least
a PLX RDK Lite evaluation board... It would be possible to modify it to
allow opens even there is no card. Or to malloc a 4 MB buffer also for the minor
31 device, which is my dummy test minor that needs no HW.

Of course you couldn't use the PLX DMA engine then. But you could still mmap
the RAM to user space.

An alternative to sending you a driver (which could make your box instable
temporaryly) is to let you use my Linux box at home. Damn, why didn't I let
you log in from Oldenburg... I forgot about that possibility. I took a PLX eval
board home with me already friday, because here I have the real RSC cards
already.

What do you think?


> > I'll move the code to init_module later once it is stable.
> 
> even init_module() can be executed much later: eg. kmod removes the module
> because it's unused, and it's reinserted later. So generally it's really
> unrobust to expect a 9th order allocation to succeed at module_init()
> time.

For our application (dedicated System) I could guarantee even that.

> the fundamental issue is not the lazyness of Linux VM developers. 99.9% of
> all allocations are order 0. 99.9% of the remaining allocations are order
> 1 or 2.

I wonder why only I see problems so far. Maybe it's because I also mmap()
that RAM to user space?



> (later on we could even add support to grow and shrink the size of the
> physical memory pool (within certain boundaries), so it could be sized
> boot-time.)
> 
> would anything like this be useful? Since it's a completely separate pool
> (in fact it wont even show up in the normal memory statistics), it does
> not disturb the existing VM in any way.

It would'nt even be needed in the moment. The 9-order get_free_pages() does
not explicitly fail. Not even during later open()s. If it would I would
simply add more RAM. (well, let the company pay it) 256 MB are in and that is
enough so far.

Later I will load the module explicitly right after boot and then it's
almost sure I will get the RAM.

Well, as I said, get_free_pages doesn't even fail! It just seems to allow
others to use the RAM before I free it again... Or it corrupts some kernel
structs during munmap(), which certainly decrements the usage counter of the
upper pages to 0 again.

For now I'll try to reproduce instability without using a DMA Hardware.

Thanks,

-- 
Bernd Harries

bha@gmx.de            http://bharries.freeyellow.com
bharries@web.de       Tel. +49 421 809 7343 priv.  | MSB First!
harries@stn-atlas.de       +49 421 457 3966 offi.  | Linux-m68k
bernd@linux-m68k.org       +49 172 139 6054 handy  | Medusa T40

GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
  2001-09-27 14:38 ` Eric W. Biederman
@ 2001-09-29  7:32   ` Bernd Harries
  0 siblings, 0 replies; 17+ messages in thread
From: Bernd Harries @ 2001-09-29  7:32 UTC (permalink / raw)
  Cc: linux-kernel

"Eric W. Biederman" wrote:

> Ouch.  This is where I give you the standard recommendation.  If you
> do this scatter gatter (so you don't need megs of continuous memory)
> you should be much better off, and your driver should be more
> reliable.

Yep, the firmware on the pixel DSP behind the PLX-9054 bridge wants
the base address of a linear 2K * 1K * 2 picture buffer so that it can
trigger the one of the 9054's DMA engines to transfer triangles line 
by line into my memory buffer. If I mmap the PCI space to userland, each read
cycle costs 700-900 ns. The DMA engine can use bursts and then a cycle costs
only 29.9 ns of PCI bandwidth.

>  All of the other techniques you have used like mmap should
> still apply.

> Also if you are exporting this data to user space, before your DMA
> complets you want to zero the pages you have allocated, so you don't
> have an information leak.

The DMA engine in the PLX 9054 can at least do Write-and-Invalidate cycles to
the Main RAM. :-)

Ciao,
-- 
Bernd Harries

bha@gmx.de           http://www.freeyellow.com/members/bharries
bha@nikocity.de       Tel. +49 421 809 7343 priv.  | MSB First!
harries@stn-atlas.de       +49 421 457 3966 offi.  | Linux-m68k
bernd@linux-m68k.org      8.48'21" E  52.48'52" N  | Medusa T40

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
  2001-09-27  8:56 Bernd Harries
  2001-09-27  9:15 ` Ingo Molnar
  2001-09-27  9:20 ` Ingo Molnar
@ 2001-09-27 14:38 ` Eric W. Biederman
  2001-09-29  7:32   ` Bernd Harries
  2 siblings, 1 reply; 17+ messages in thread
From: Eric W. Biederman @ 2001-09-27 14:38 UTC (permalink / raw)
  To: Bernd Harries; +Cc: linux-kernel

Bernd Harries <mlbha@gmx.de> writes:

> Hi all,

> In a driver I'm writing, in the open() method, I use multiple 
> __get_free_pages() to allocate a 4 MB kernel (image)buffer for DMA purposes.
> The buffer I get is contiguous (I try until it is) and is freed in
> close(). Order count is 9.

Ouch.  This is where I give you the standard recommendation.  If you
do this scatter gatter (so you don't need megs of continuous memory)
you should be much better off, and your driver should be more
reliable.  All of the other techniques you have used like mmap should
still apply.

Also if you are exporting this data to user space, before your DMA
complets you want to zero the pages you have allocated, so you don't
have an information leak.

Eric


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
@ 2001-09-27 14:19 Bernd Harries
  0 siblings, 0 replies; 17+ messages in thread
From: Bernd Harries @ 2001-09-27 14:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo

Ingo Molnar wrote:

> well - what did you expect to happen? A freed page is going to be reused
> for other purposes. A big 2MB allocation can be reused in part, once
> memory usage grows.

With my knowledge, I expected exactly that.

> So you should not expect the device to be able to DMA
> into a page that got freed, unpunished.

I am not. The DMA ioctl() finishes before the close() -> free happens after
the hexdump and the DMA. The buffer is allocated in open. The fact that I get
the same buffer again next time shows that the free is sucessful and
effective, right?

Sep 27 11:43:28 pcma73 kernel: rsc_open() minor=$1B 
Sep 27 11:43:28 pcma73 kernel:  DMA blk 0 at KV:$CE800000 BUS:$0E800000 
Sep 27 11:43:28 pcma73 kernel:  DMA blk 1 at KV:$CE600000 BUS:$0E600000
contig < 

Sep 27 11:43:28 pcma73 kernel:  Collected DMA Buffer1 at KS:$0000CE600000

Sep 27 11:43:28 pcma73 kernel: rsc_ioctl()
Sep 27 11:43:28 pcma73 kernel:  RSC_IOC_DMA_OUT

Sep 27 11:43:28 pcma73 kernel: rsc_close()

Sep 27 11:43:28 pcma73 kernel:  Free DMA blk 0 at KS:$CE800000 
Sep 27 11:43:28 pcma73 kernel:  Free DMA blk 1 at KS:$CE600000 

> Perhaps i'm misunderstanding the problem.

My problem is, I'm out of ideas. All I can think of is describe as much as
possible the relevant things that I do and the things that occur. Maybe
someone more experienced recognizes a principal flaw in the concept.

> Plus, if you allocate a 2MB
> physically continuous chunk then the likelyhood is high that there were
> fragmented pages skipped during the initial search for a 2MB block - so
> you still have a fair likelyhood to reallocate it after some time, if
> memory usage is light. But this likelyhood nears zero once RAM usage gets
> near 100%.

And I can rely on the fact that all the 2 MB are contig memory without
holes, right? It's completely mine, isn't it?
Or is it perhaps illegal to let the mem usage pump?
Should I better allocate the mem in init_module() instead of rsc_open()?
Probably page tables are more likely to get corrupted than they would be if
I allocate only once. Or do I have to use a spin_lock somewhere in the nopage
method?


>From my tests I'm ready the believe the 1st page really _is_ mine but now
I'm not so sure all the 
(1 << 9) pages really are.

If I don't access the pages, just allocate them and free them after some
time, I never saw any instabilities. But it seems that as soon as I access pages
above the 1st in the buffer, something gets corrupted. So maybe today it's
only legal to allocate 1 page at a time and I have to do that 
(1<<10) times...

Or maybe some of the VM trouble I read about recntly would also cover my
problems?

Thanks,

-- 
Bernd Harries

bha@gmx.de            http://bharries.freeyellow.com
bharries@web.de       Tel. +49 421 809 7343 priv.  | MSB First!
harries@stn-atlas.de       +49 421 457 3966 offi.  | Linux-m68k
bernd@linux-m68k.org       +49 172 139 6054 handy  | Medusa T40

GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
  2001-09-27  8:56 Bernd Harries
  2001-09-27  9:15 ` Ingo Molnar
@ 2001-09-27  9:20 ` Ingo Molnar
  2001-09-27 14:38 ` Eric W. Biederman
  2 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2001-09-27  9:20 UTC (permalink / raw)
  To: Bernd Harries; +Cc: linux-kernel


another method is to apply the attached patch to 2.4.10, and watch the
stack traces whether it all happens in the order and places you intended
it to. Your driver should be the only thing doing order-9 allocations on
your system.

	Ingo

--- linux/mm/page_alloc.c.orig	Thu Sep 27 11:04:02 2001
+++ linux/mm/page_alloc.c	Thu Sep 27 11:05:27 2001
@@ -70,6 +70,10 @@
 	struct page *base;
 	zone_t *zone;

+	if (order == 9) {
+		printk("free_pages order 9 called.\n");
+		show_stack(NULL);
+	}
 	if (page->buffers)
 		BUG();
 	if (page->mapping)
@@ -319,6 +323,10 @@
 	struct page * page;
 	int freed;

+	if (order == 9) {
+		printk("alloc_pages order 9 called.\n");
+		show_stack(NULL);
+	}
 	zone = zonelist->zones;
 	classzone = *zone;
 	for (;;) {


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: __get_free_pages(): is the MEM really mine?
  2001-09-27  8:56 Bernd Harries
@ 2001-09-27  9:15 ` Ingo Molnar
  2001-09-27  9:20 ` Ingo Molnar
  2001-09-27 14:38 ` Eric W. Biederman
  2 siblings, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2001-09-27  9:15 UTC (permalink / raw)
  To: Bernd Harries; +Cc: linux-kernel


On Thu, 27 Sep 2001, Bernd Harries wrote:

> Is __get_free_pages() not enough to allocate memory in the kernel?
> Seems like something else is using the same memory. Do I have to lock
> the pages I allocated?

it's enough. The pages you allocate through the Linux page allocator are
private, no additional locking is necessery

> In a driver I'm writing, in the open() method, I use multiple
> __get_free_pages() to allocate a 4 MB kernel (image)buffer for DMA
> purposes. The buffer I get is contiguous (I try until it is) and is
> freed in close(). Order count is 9.

so you are using two __get_free_pages(order==9) calls to get two chunks of
2 MB physical areas, and you use them as a single 4 MB area? This is
perfectly legal - but there is no guarantee you will succeed getting two
nearby 2 MB pages. You will get it if your driver initializes during
bootup - but if it's loaded/unloaded via kmod while the system is up and
running and using its RAM, chances are very low that you'll get a single 2
MB page - let alone two that are adjacent.

> When I run the user appl. again after short time I mostly get the same
> chunk of physical memory (virt_to_bus is identical!)

have you perpahs freed that page? printk every occasion of
allocating/freeing a 2 MB buffer and i'm sure you'll see the problem.
(Perhaps it's the close() implicitly done by exit() that frees the
buffer?)

> If I repeat the user program within seconds, suddenly the 2nd 256 byte
> dump starts to change. Sometimes I see filenames of my harddisk within
> the hexdump, looking like some directory listing. (e.g.
> "/etc/ppp/options" ) Sometimes I see the contents of the printk buffer
> of the kernel, sometimes stuff I cannot identify.

it appears you freed the page. send the relevant parts of the driver code
for details.

	Ingo


^ permalink raw reply	[flat|nested] 17+ messages in thread

* __get_free_pages(): is the MEM really mine?
@ 2001-09-27  8:56 Bernd Harries
  2001-09-27  9:15 ` Ingo Molnar
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Bernd Harries @ 2001-09-27  8:56 UTC (permalink / raw)
  To: linux-kernel

Hi all,

this is my 4th try to post to the list. I didn't see any echo, so 
I try again. Sorry if you did see the msg earlier (yesterday)..

Is __get_free_pages() not enough to allocate memory in the kernel?
Seems like something else is using the same memory. Do I have to lock the
pages I allocated? 

I began with 2.4.6 on a dual CPU x86 box with 256 MB RAM and when I saw
probs I upgraded to 2.4.10. Still unstable.

In a driver I'm writing, in the open() method, I use multiple 
__get_free_pages() to allocate a 4 MB kernel (image)buffer for DMA purposes.
The buffer I get is contiguous (I try until it is) and is freed in
close(). Order count is 9.

When I run the user appl. again after short time I mostly get the 
same chunk of physical memory (virt_to_bus is identical!)

For access from userspace I implemented mmap() which uses the nopage()
method of the VMA. The user program hexdumps 256 bytes of the beginning
of the 4 MB buffer and 256 bytes of 0x2000 above the beginning.

After the hexdump fromm userspace I trigger a DMA engine to copy 
0x8000 bytes (4 * the offset of the 2nd hexdump) from my kernelbuffer to a
'Local RAM' on a PCI card. (For now I only copy out to be sure the
buffer is not modified)

I see mostly zeroes in both of the 2 hexdumps.

If I repeat the user program within seconds, suddenly the 2nd 
256 byte dump starts to change. Sometimes I see filenames of my harddisk
within the hexdump, looking like some directory listing. (e.g.
"/etc/ppp/options" ) Sometimes I see the contents of the printk buffer of
the kernel, sometimes stuff I cannot identify.

The dump form the first page seems to stay zero all the time. 
The bus address of the Buffer is the same each time.

I wouldn't bother about the changes if the system wouldn't seem 
to become compromised by the tests. Sometimes a dump occurs on the console
when I try to buid a new version of my driver module.
Sometimes the shell in which I started the test program gets logged out.

I have a feeling that the effect only occurs if the 2nd dump is beyond the
1st page of my kernel buffer.



Here is the output of my test program:

pcma73:/home/bharries/c/apr/>aprdma_shmw 0x8000 0 1
 open('/dev/aprsc027', ) seems ok! fd = 3 
 Get fix par 
 mmio: start=$DC800000 off=$00000000 len=$00001000 
 mem1: start=$E0000000 off=$00000000 len=$02000000 
 mem2: start=$DA000000 off=$00000000 len=$02000000 

 colcon_offs=$00000000 
 fifo1_offs =$01000000 
 fifo2_offs =$01100000 
 shm_offs   =$01400000 shm_ram_size=$00400000 
 hwcsr_offs =$01A00000 

 Get var par 
 rx_pmd_adr  =$00000000 rx_msg_typ =$00000000 
 tx_pmd_adr  =$00000000 tx_msg_typ =$00000000 
 dma_bus_adr0=$00000000 contig_len0=$00000000 
 dma_bus_adr1=$03800000 contig_len1=$00400000    <-- BUS Addr

 dma0=$00000000 len=$00000000 
 dma1=$40132000 len=$00400000           <-- mmapped User Addr

Diagnose Dump Adr=$40132000

:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00                  
  
Diagnose Dump Adr=$40134000

:3D 24 30 30 30 30 30 30 30 30 20 0A 53 65 70 20  =$00000000 *Sep 
:32 36 20 31 32 3A 31 35 3A 30 34 20 70 63 6D 61  26 12:15:04 pcma
:37 33 20 6B 65 72 6E 65 6C 3A 20 20 73 74 61 72  73 kernel:  star
:74 2B 6F 66 66 3D 24 43 33 38 30 30 30 30 30 20  t+off=$C3800000 
:70 61 67 65 5F 70 74 72 3D 24 63 31 30 65 30 30  page_ptr=$c10e00
:30 30 20 0A 53 65 70 20 32 36 20 31 32 3A 31 35  00 *Sep 26 12:15
:3A 30 34 20 70 63 6D 61 37 33 20 6B 65 72 6E 65  :04 pcma73 kerne
:6C 3A 20 20 61 64 64 72 65 73 73 3D 24 34 30 31  l:  address=$401
:33 34 30 30 30 20 61 64 20 2D 20 76 6D 5F 73 74  34000 ad - vm_st
:61 72 74 3D 24 30 30 30 30 32 30 30 30 20 56 4D  art=$00002000 VM
:41 5F 4F 46 46 53 45 54 3D 24 30 30 30 30 30 30  A_OFFSET=$000000
:30 30 20 0A 53 65 70 20 32 36 20 31 32 3A 31 35  00 *Sep 26 12:15
:3A 30 34 20 70 63 6D 61 37 33 20 6B 65 72 6E 65  :04 pcma73 kerne
:6C 3A 20 20 73 74 61 72 74 2B 6F 66 66 3D 24 43  l:  start+off=$C
:33 38 30 32 30 30 30 20 70 61 67 65 5F 70 74 72  3802000 page_ptr
:3D 24 63 31 30 65 30 30 38 30 20 0A 00 00 00 00  =$c10e0080 *    
   Fill DMA ioctl struct 
 Local RAM write triggered. 
 Local RAM write end. 

 Now close '/dev/aprsc027' fd = 3 ...




-- 
-- 
Bernd Harries

bha@gmx.de            http://bharries.freeyellow.com
bharries@web.de       Tel. +49 421 809 7343 priv.  | MSB First!
harries@stn-atlas.de       +49 421 457 3966 offi.  | Linux-m68k
bernd@linux-m68k.org       +49 172 139 6054 handy  | Medusa T40

GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2001-10-05 15:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-27 10:06 __get_free_pages(): is the MEM really mine? Bernd Harries
2001-09-27 13:00 ` Ingo Molnar
2001-09-29 17:15   ` Bernd Harries
2001-09-30  7:27     ` Ingo Molnar
2001-09-30 12:59       ` Bernd Harries
2001-10-01  5:55         ` Ingo Molnar
2001-10-05  8:49           ` Bernd Harries
  -- strict thread matches above, loose matches on Subject: below --
2001-10-01 11:33 Bernd Harries
2001-10-05 12:55 ` Hugh Dickins
2001-10-05 13:32   ` Bernd Harries
2001-10-05 15:27     ` Hugh Dickins
2001-09-27 14:19 Bernd Harries
2001-09-27  8:56 Bernd Harries
2001-09-27  9:15 ` Ingo Molnar
2001-09-27  9:20 ` Ingo Molnar
2001-09-27 14:38 ` Eric W. Biederman
2001-09-29  7:32   ` Bernd Harries

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).