LKML Archive on lore.kernel.org help / color / mirror / Atom feed
* [PATCH 0/3] mm: optimize thp for reclaim and migration @ 2021-07-31 6:39 Yu Zhao 2021-07-31 6:39 ` [PATCH 1/3] mm: don't take lru lock when splitting isolated thp Yu Zhao ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Yu Zhao @ 2021-07-31 6:39 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Yang Shi, Zi Yan, linux-kernel, Yu Zhao Systems using /sys/kernel/mm/transparent_hugepage/enabled=always can experience memory pressure due to internal fragmentation from a large number of thp. Userspace can leave many subpages untouched but reclaim can't identify them based on the dirty bit (and drop them like what it does to clean pages). However, it's still possible to know whether a subpage is equivalent to being clean by checking its content. When splitting a thp for reclaim or migration, we can drop subpages that contain only zeros and therefore avoid writing them back or duplicating them. benchmark ========= The best case scenario is only the first byte of head is non zero. The worse case scenario is only the last byte of every subpage is non zero. zram 10GB ~~~~~~~~~ best case 7.170s 3.559s -50.36% worst case 8.216s 8.94s +8.81% swap 10GB ~~~~~~~~~ best case 70.466s 3.544s -94.97% worst case 67.014s 65.521s -2.22% (noise) (CONFIG_THP_SWAP=n) zram (before) ============= best case ~~~~~~~~~ time 7.170s 21.78% clear_page_erms 16.47% zram_bvec_rw 2.90% clear_huge_page 2.74% _raw_spin_lock 2.65% page_vma_mapped_walk 2.41% flush_tlb_func 2.21% shrink_page_list 1.55% try_to_unmap_one 1.29% _raw_spin_lock_irqsave 1.25% page_counter_cancel 1.21% __mod_node_page_state 1.17% __mod_lruvec_page_state 1.10% add_to_swap_cache 1.08% xas_create 1.07% xas_store worst case ~~~~~~~~~~ time 8.216s 17.27% clear_page_erms 13.25% lzo1x_1_do_compress 4.35% zram_bvec_rw 3.41% memset_erms 3.23% _raw_spin_lock 2.88% page_vma_mapped_walk 2.75% clear_huge_page 2.28% flush_tlb_func 1.92% shrink_page_list 1.39% try_to_unmap_one 1.21% page_counter_cancel 1.20% zs_free 1.14% _raw_spin_lock_irqsave 1.13% xas_create 1.02% __mod_lruvec_page_state zram (after) ============ best case ~~~~~~~~~ time 3.559s 44.55% clear_page_erms 27.74% memchr_inv 6.43% clear_huge_page 2.71% split_huge_page_to_list 1.79% page_vma_mapped_walk 1.31% __split_huge_pmd 1.21% remove_migration_pte 1.12% __free_one_page worst case ~~~~~~~~~~ time 8.94s 16.08% clear_page_erms 11.81% memchr_inv 9.62% lzo1x_1_do_compress 3.51% memset_erms 3.17% zram_bvec_rw 2.76% _raw_spin_lock 2.70% clear_huge_page 2.34% flush_tlb_func 2.25% page_vma_mapped_walk 1.56% shrink_page_list 1.33% try_to_unmap_one 1.07% _raw_spin_lock_irqsave 1.04% xas_create 1.01% zs_free disk (before) ============ best case ~~~~~~~~~ time 70.466s 23.38% clear_page_erms 3.91% clear_huge_page 3.32% _raw_spin_lock 2.95% page_vma_mapped_walk 2.43% shrink_page_list 2.25% flush_tlb_func 1.84% try_to_unmap_one 1.53% _raw_spin_lock_irqsave 1.45% __mod_memcg_lruvec_state 1.37% page_counter_cancel 1.17% _raw_spin_lock_irq 1.16% xas_create 1.12% _find_next_bit 1.11% xas_store 1.11% add_to_swap_cache 1.10% __mod_node_page_state 1.04% __mod_lruvec_page_state 1.04% __free_one_page worst case ~~~~~~~~~~ time 67.014s 25.54% clear_page_erms 4.36% clear_huge_page 3.51% _raw_spin_lock 2.97% page_vma_mapped_walk 2.85% flush_tlb_func 1.84% try_to_unmap_one 1.52% shrink_page_list 1.47% __mod_memcg_lruvec_state 1.42% page_counter_cancel 1.38% _raw_spin_lock_irq 1.28% __mod_lruvec_page_state 1.15% _raw_spin_lock_irqsave 1.13% xas_load 1.10% add_to_swap_cache 1.05% __mod_node_page_state disk (after) ============ best case ~~~~~~~~~ time 3.544s 42.58% clear_page_erms 27.44% memchr_inv 6.15% clear_huge_page 2.60% split_huge_page_to_list 1.74% page_vma_mapped_walk 1.26% __free_one_page 1.11% remove_migration_pte 1.10% __split_huge_pmd 1.08% __list_del_entry_valid worst case ~~~~~~~~~~ time 65.521s 21.83% clear_page_erms 14.39% memchr_inv 3.66% clear_huge_page 3.17% _raw_spin_lock 2.38% page_vma_mapped_walk 2.32% flush_tlb_func 1.59% try_to_unmap_one 1.39% shrink_page_list 1.26% page_counter_cancel 1.16% __mod_memcg_lruvec_state 1.11% add_to_swap_cache 1.05% _raw_spin_lock_irq test.c ====== #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <unistd.h> #define handle_error(msg) \ do { perror(msg); exit(EXIT_FAILURE); } while (0) int main(int argc, char **argv) { char *addr; size_t size; int offset, num_dirty; int ret; if (argc != 4) { printf("Usage: ./a.out SIZE OFFSET NUM_DIRTY\n" "SIZE: Size of mmap region in MB\n" "OFFSET: Offset of (first) dirty byte in each 4KB page, 0 ~ 4095\n" "NUM_DIRTY: Number of dirty 4KB pages in each 2MB region, 0 ~ 512\n"); return -1; } size = (size_t)(((size_t)atoi(argv[1])) << 20); offset = atoi(argv[2]); num_dirty = atoi(argv[3]); addr = (char *)mmap(NULL, size , PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (addr == MAP_FAILED) handle_error("mmap"); ret = madvise(addr, size, MADV_HUGEPAGE); if (ret == -1) handle_error("mmap hugepage"); for (size_t i = 0; i < size; i += (1 << 21)) for (size_t j = 0; j < (num_dirty << 12); j += (1 << 12)) memset(addr + i + j + offset, 0xff, 1); ret = madvise(addr, size, MADV_PAGEOUT); if (ret == -1) handle_error("mmap pageout"); ret = munmap(addr, size); if (ret == -1) handle_error("munmap"); return 0; } Yu Zhao (3): mm: don't take lru lock when splitting isolated thp mm: free zapped tail pages when splitting isolated thp mm: don't remap clean subpages when splitting isolated thp include/linux/rmap.h | 2 +- include/linux/vm_event_item.h | 2 ++ mm/huge_memory.c | 63 ++++++++++++++++++++++++--------- mm/migrate.c | 65 ++++++++++++++++++++++++++++++----- mm/vmstat.c | 2 ++ 5 files changed, 107 insertions(+), 27 deletions(-) -- 2.32.0.554.ge1b32706d8-goog ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/3] mm: don't take lru lock when splitting isolated thp 2021-07-31 6:39 [PATCH 0/3] mm: optimize thp for reclaim and migration Yu Zhao @ 2021-07-31 6:39 ` Yu Zhao 2021-07-31 6:39 ` [PATCH 2/3] mm: free zapped tail pages " Yu Zhao 2021-07-31 6:39 ` [PATCH 3/3] mm: don't remap clean subpages " Yu Zhao 2 siblings, 0 replies; 21+ messages in thread From: Yu Zhao @ 2021-07-31 6:39 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Yang Shi, Zi Yan, linux-kernel, Yu Zhao, Shuang Zhai We won't put its tail pages on lru when splitting an isolated thp under reclaim or migration, and therefore we don't need to take the lru lock. Signed-off-by: Yu Zhao <yuzhao@google.com> Tested-by: Shuang Zhai <zhais@google.com> --- mm/huge_memory.c | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index afff3ac87067..d8b655856e79 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2342,17 +2342,14 @@ static void remap_page(struct page *page, unsigned int nr) } } -static void lru_add_page_tail(struct page *head, struct page *tail, - struct lruvec *lruvec, struct list_head *list) +static void lru_add_page_tail(struct page *head, struct page *tail, struct list_head *list) { VM_BUG_ON_PAGE(!PageHead(head), head); VM_BUG_ON_PAGE(PageCompound(tail), head); VM_BUG_ON_PAGE(PageLRU(tail), head); - lockdep_assert_held(&lruvec->lru_lock); if (list) { - /* page reclaim is reclaiming a huge page */ - VM_WARN_ON(PageLRU(head)); + /* page reclaim or migration is splitting an isolated thp */ get_page(tail); list_add_tail(&tail->lru, list); } else { @@ -2363,8 +2360,7 @@ static void lru_add_page_tail(struct page *head, struct page *tail, } } -static void __split_huge_page_tail(struct page *head, int tail, - struct lruvec *lruvec, struct list_head *list) +static void __split_huge_page_tail(struct page *head, int tail, struct list_head *list) { struct page *page_tail = head + tail; @@ -2425,19 +2421,21 @@ static void __split_huge_page_tail(struct page *head, int tail, * pages to show after the currently processed elements - e.g. * migrate_pages */ - lru_add_page_tail(head, page_tail, lruvec, list); + lru_add_page_tail(head, page_tail, list); } static void __split_huge_page(struct page *page, struct list_head *list, pgoff_t end) { struct page *head = compound_head(page); - struct lruvec *lruvec; + struct lruvec *lruvec = NULL; struct address_space *swap_cache = NULL; unsigned long offset = 0; unsigned int nr = thp_nr_pages(head); int i; + VM_BUG_ON_PAGE(list && PageLRU(head), head); + /* complete memcg works before add pages to LRU */ split_page_memcg(head, nr); @@ -2450,10 +2448,11 @@ static void __split_huge_page(struct page *page, struct list_head *list, } /* lock lru list/PageCompound, ref frozen by page_ref_freeze */ - lruvec = lock_page_lruvec(head); + if (!list) + lruvec = lock_page_lruvec(head); for (i = nr - 1; i >= 1; i--) { - __split_huge_page_tail(head, i, lruvec, list); + __split_huge_page_tail(head, i, list); /* Some pages can be beyond i_size: drop them from page cache */ if (head[i].index >= end) { ClearPageDirty(head + i); @@ -2471,7 +2470,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, } ClearPageCompound(head); - unlock_page_lruvec(lruvec); + if (lruvec) + unlock_page_lruvec(lruvec); /* Caller disabled irqs, so they are still disabled here */ split_page_owner(head, nr); @@ -2645,6 +2645,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) VM_BUG_ON_PAGE(is_huge_zero_page(head), head); VM_BUG_ON_PAGE(!PageLocked(head), head); VM_BUG_ON_PAGE(!PageCompound(head), head); + VM_BUG_ON_PAGE(list && PageLRU(head), head); if (PageWriteback(head)) return -EBUSY; -- 2.32.0.554.ge1b32706d8-goog ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 2/3] mm: free zapped tail pages when splitting isolated thp 2021-07-31 6:39 [PATCH 0/3] mm: optimize thp for reclaim and migration Yu Zhao 2021-07-31 6:39 ` [PATCH 1/3] mm: don't take lru lock when splitting isolated thp Yu Zhao @ 2021-07-31 6:39 ` Yu Zhao 2021-08-04 14:22 ` Kirill A. Shutemov 2021-08-05 0:13 ` Yang Shi 2021-07-31 6:39 ` [PATCH 3/3] mm: don't remap clean subpages " Yu Zhao 2 siblings, 2 replies; 21+ messages in thread From: Yu Zhao @ 2021-07-31 6:39 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Yang Shi, Zi Yan, linux-kernel, Yu Zhao, Shuang Zhai If a tail page has only two references left, one inherited from the isolation of its head and the other from lru_add_page_tail() which we are about to drop, it means this tail page was concurrently zapped. Then we can safely free it and save page reclaim or migration the trouble of trying it. Signed-off-by: Yu Zhao <yuzhao@google.com> Tested-by: Shuang Zhai <zhais@google.com> --- include/linux/vm_event_item.h | 1 + mm/huge_memory.c | 28 ++++++++++++++++++++++++++++ mm/vmstat.c | 1 + 3 files changed, 30 insertions(+) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index ae0dd1948c2b..829eeac84094 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -99,6 +99,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD THP_SPLIT_PUD, #endif + THP_SPLIT_FREE, THP_ZERO_PAGE_ALLOC, THP_ZERO_PAGE_ALLOC_FAILED, THP_SWPOUT, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d8b655856e79..5120478bca41 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2432,6 +2432,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, struct address_space *swap_cache = NULL; unsigned long offset = 0; unsigned int nr = thp_nr_pages(head); + LIST_HEAD(pages_to_free); + int nr_pages_to_free = 0; int i; VM_BUG_ON_PAGE(list && PageLRU(head), head); @@ -2506,6 +2508,25 @@ static void __split_huge_page(struct page *page, struct list_head *list, continue; unlock_page(subpage); + /* + * If a tail page has only two references left, one inherited + * from the isolation of its head and the other from + * lru_add_page_tail() which we are about to drop, it means this + * tail page was concurrently zapped. Then we can safely free it + * and save page reclaim or migration the trouble of trying it. + */ + if (list && page_ref_freeze(subpage, 2)) { + VM_BUG_ON_PAGE(PageLRU(subpage), subpage); + VM_BUG_ON_PAGE(PageCompound(subpage), subpage); + VM_BUG_ON_PAGE(page_mapped(subpage), subpage); + + ClearPageActive(subpage); + ClearPageUnevictable(subpage); + list_move(&subpage->lru, &pages_to_free); + nr_pages_to_free++; + continue; + } + /* * Subpages may be freed if there wasn't any mapping * like if add_to_swap() is running on a lru page that @@ -2515,6 +2536,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, */ put_page(subpage); } + + if (!nr_pages_to_free) + return; + + mem_cgroup_uncharge_list(&pages_to_free); + free_unref_page_list(&pages_to_free); + count_vm_events(THP_SPLIT_FREE, nr_pages_to_free); } int total_mapcount(struct page *page) diff --git a/mm/vmstat.c b/mm/vmstat.c index b0534e068166..f486e5d98d96 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1300,6 +1300,7 @@ const char * const vmstat_text[] = { #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD "thp_split_pud", #endif + "thp_split_free", "thp_zero_page_alloc", "thp_zero_page_alloc_failed", "thp_swpout", -- 2.32.0.554.ge1b32706d8-goog ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] mm: free zapped tail pages when splitting isolated thp 2021-07-31 6:39 ` [PATCH 2/3] mm: free zapped tail pages " Yu Zhao @ 2021-08-04 14:22 ` Kirill A. Shutemov 2021-08-08 17:28 ` Yu Zhao 2021-08-05 0:13 ` Yang Shi 1 sibling, 1 reply; 21+ messages in thread From: Kirill A. Shutemov @ 2021-08-04 14:22 UTC (permalink / raw) To: Yu Zhao Cc: linux-mm, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Yang Shi, Zi Yan, linux-kernel, Shuang Zhai On Sat, Jul 31, 2021 at 12:39:37AM -0600, Yu Zhao wrote: > @@ -2506,6 +2508,25 @@ static void __split_huge_page(struct page *page, struct list_head *list, > continue; > unlock_page(subpage); > > + /* > + * If a tail page has only two references left, one inherited > + * from the isolation of its head and the other from > + * lru_add_page_tail() which we are about to drop, it means this > + * tail page was concurrently zapped. Then we can safely free it > + * and save page reclaim or migration the trouble of trying it. > + */ > + if (list && page_ref_freeze(subpage, 2)) { > + VM_BUG_ON_PAGE(PageLRU(subpage), subpage); > + VM_BUG_ON_PAGE(PageCompound(subpage), subpage); > + VM_BUG_ON_PAGE(page_mapped(subpage), subpage); > + > + ClearPageActive(subpage); > + ClearPageUnevictable(subpage); Why touch PG_Active/PG_Unevictable? > + list_move(&subpage->lru, &pages_to_free); > + nr_pages_to_free++; > + continue; > + } > + -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] mm: free zapped tail pages when splitting isolated thp 2021-08-04 14:22 ` Kirill A. Shutemov @ 2021-08-08 17:28 ` Yu Zhao 0 siblings, 0 replies; 21+ messages in thread From: Yu Zhao @ 2021-08-08 17:28 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Linux-MM, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Yang Shi, Zi Yan, linux-kernel, Shuang Zhai On Wed, Aug 4, 2021 at 8:22 AM Kirill A. Shutemov <kirill@shutemov.name> wrote: > > On Sat, Jul 31, 2021 at 12:39:37AM -0600, Yu Zhao wrote: > > @@ -2506,6 +2508,25 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > continue; > > unlock_page(subpage); > > > > + /* > > + * If a tail page has only two references left, one inherited > > + * from the isolation of its head and the other from > > + * lru_add_page_tail() which we are about to drop, it means this > > + * tail page was concurrently zapped. Then we can safely free it > > + * and save page reclaim or migration the trouble of trying it. > > + */ > > + if (list && page_ref_freeze(subpage, 2)) { > > + VM_BUG_ON_PAGE(PageLRU(subpage), subpage); > > + VM_BUG_ON_PAGE(PageCompound(subpage), subpage); > > + VM_BUG_ON_PAGE(page_mapped(subpage), subpage); > > + > > + ClearPageActive(subpage); > > + ClearPageUnevictable(subpage); > > Why touch PG_Active/PG_Unevictable? Subpages may inherit these flags from their isolated head. Page reclaim doesn't isolate active or unevictable. But migration does. If we don't clear them here, we'll hit bad_page() later because both flags are included in PAGE_FLAGS_CHECK_AT_FREE. Does it make sense? Thanks. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] mm: free zapped tail pages when splitting isolated thp 2021-07-31 6:39 ` [PATCH 2/3] mm: free zapped tail pages " Yu Zhao 2021-08-04 14:22 ` Kirill A. Shutemov @ 2021-08-05 0:13 ` Yang Shi 2021-08-08 17:49 ` Yu Zhao 1 sibling, 1 reply; 21+ messages in thread From: Yang Shi @ 2021-08-05 0:13 UTC (permalink / raw) To: Yu Zhao Cc: Linux MM, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Zi Yan, Linux Kernel Mailing List, Shuang Zhai On Fri, Jul 30, 2021 at 11:39 PM Yu Zhao <yuzhao@google.com> wrote: > > If a tail page has only two references left, one inherited from the > isolation of its head and the other from lru_add_page_tail() which we > are about to drop, it means this tail page was concurrently zapped. > Then we can safely free it and save page reclaim or migration the > trouble of trying it. > > Signed-off-by: Yu Zhao <yuzhao@google.com> > Tested-by: Shuang Zhai <zhais@google.com> > --- > include/linux/vm_event_item.h | 1 + > mm/huge_memory.c | 28 ++++++++++++++++++++++++++++ > mm/vmstat.c | 1 + > 3 files changed, 30 insertions(+) > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index ae0dd1948c2b..829eeac84094 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -99,6 +99,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > THP_SPLIT_PUD, > #endif > + THP_SPLIT_FREE, > THP_ZERO_PAGE_ALLOC, > THP_ZERO_PAGE_ALLOC_FAILED, > THP_SWPOUT, > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index d8b655856e79..5120478bca41 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2432,6 +2432,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, > struct address_space *swap_cache = NULL; > unsigned long offset = 0; > unsigned int nr = thp_nr_pages(head); > + LIST_HEAD(pages_to_free); > + int nr_pages_to_free = 0; > int i; > > VM_BUG_ON_PAGE(list && PageLRU(head), head); > @@ -2506,6 +2508,25 @@ static void __split_huge_page(struct page *page, struct list_head *list, > continue; > unlock_page(subpage); > > + /* > + * If a tail page has only two references left, one inherited > + * from the isolation of its head and the other from > + * lru_add_page_tail() which we are about to drop, it means this > + * tail page was concurrently zapped. Then we can safely free it > + * and save page reclaim or migration the trouble of trying it. > + */ > + if (list && page_ref_freeze(subpage, 2)) { > + VM_BUG_ON_PAGE(PageLRU(subpage), subpage); > + VM_BUG_ON_PAGE(PageCompound(subpage), subpage); > + VM_BUG_ON_PAGE(page_mapped(subpage), subpage); > + > + ClearPageActive(subpage); > + ClearPageUnevictable(subpage); > + list_move(&subpage->lru, &pages_to_free); > + nr_pages_to_free++; > + continue; > + } Yes, such page could be freed instead of swapping out. But I'm wondering if we could have some simpler implementation. Since such pages will be re-added to page list, so we should be able to check their refcount in shrink_page_list(). If the refcount is 1, the refcount inc'ed by lru_add_page_tail() has been put by later put_page(), we know it is freed under us since the only refcount comes from isolation, we could just jump to "keep" (the label in shrink_page_list()), then such page will be freed later by shrink_inactive_list(). For MADV_PAGEOUT, I think we could add some logic to handle such page after shrink_page_list(), just like what shrink_inactive_list() does. Migration already handles refcount == 1 page, so should not need any change. Is this idea feasible? > + > /* > * Subpages may be freed if there wasn't any mapping > * like if add_to_swap() is running on a lru page that > @@ -2515,6 +2536,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, > */ > put_page(subpage); > } > + > + if (!nr_pages_to_free) > + return; > + > + mem_cgroup_uncharge_list(&pages_to_free); > + free_unref_page_list(&pages_to_free); > + count_vm_events(THP_SPLIT_FREE, nr_pages_to_free); > } > > int total_mapcount(struct page *page) > diff --git a/mm/vmstat.c b/mm/vmstat.c > index b0534e068166..f486e5d98d96 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1300,6 +1300,7 @@ const char * const vmstat_text[] = { > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > "thp_split_pud", > #endif > + "thp_split_free", > "thp_zero_page_alloc", > "thp_zero_page_alloc_failed", > "thp_swpout", > -- > 2.32.0.554.ge1b32706d8-goog > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] mm: free zapped tail pages when splitting isolated thp 2021-08-05 0:13 ` Yang Shi @ 2021-08-08 17:49 ` Yu Zhao 2021-08-11 22:25 ` Yang Shi 0 siblings, 1 reply; 21+ messages in thread From: Yu Zhao @ 2021-08-08 17:49 UTC (permalink / raw) To: Yang Shi Cc: Linux MM, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Zi Yan, Linux Kernel Mailing List, Shuang Zhai On Wed, Aug 4, 2021 at 6:13 PM Yang Shi <shy828301@gmail.com> wrote: > > On Fri, Jul 30, 2021 at 11:39 PM Yu Zhao <yuzhao@google.com> wrote: > > > > If a tail page has only two references left, one inherited from the > > isolation of its head and the other from lru_add_page_tail() which we > > are about to drop, it means this tail page was concurrently zapped. > > Then we can safely free it and save page reclaim or migration the > > trouble of trying it. > > > > Signed-off-by: Yu Zhao <yuzhao@google.com> > > Tested-by: Shuang Zhai <zhais@google.com> > > --- > > include/linux/vm_event_item.h | 1 + > > mm/huge_memory.c | 28 ++++++++++++++++++++++++++++ > > mm/vmstat.c | 1 + > > 3 files changed, 30 insertions(+) > > > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > > index ae0dd1948c2b..829eeac84094 100644 > > --- a/include/linux/vm_event_item.h > > +++ b/include/linux/vm_event_item.h > > @@ -99,6 +99,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > THP_SPLIT_PUD, > > #endif > > + THP_SPLIT_FREE, > > THP_ZERO_PAGE_ALLOC, > > THP_ZERO_PAGE_ALLOC_FAILED, > > THP_SWPOUT, > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index d8b655856e79..5120478bca41 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -2432,6 +2432,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > struct address_space *swap_cache = NULL; > > unsigned long offset = 0; > > unsigned int nr = thp_nr_pages(head); > > + LIST_HEAD(pages_to_free); > > + int nr_pages_to_free = 0; > > int i; > > > > VM_BUG_ON_PAGE(list && PageLRU(head), head); > > @@ -2506,6 +2508,25 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > continue; > > unlock_page(subpage); > > > > + /* > > + * If a tail page has only two references left, one inherited > > + * from the isolation of its head and the other from > > + * lru_add_page_tail() which we are about to drop, it means this > > + * tail page was concurrently zapped. Then we can safely free it > > + * and save page reclaim or migration the trouble of trying it. > > + */ > > + if (list && page_ref_freeze(subpage, 2)) { > > + VM_BUG_ON_PAGE(PageLRU(subpage), subpage); > > + VM_BUG_ON_PAGE(PageCompound(subpage), subpage); > > + VM_BUG_ON_PAGE(page_mapped(subpage), subpage); > > + > > + ClearPageActive(subpage); > > + ClearPageUnevictable(subpage); > > + list_move(&subpage->lru, &pages_to_free); > > + nr_pages_to_free++; > > + continue; > > + } > > Yes, such page could be freed instead of swapping out. But I'm > wondering if we could have some simpler implementation. Since such > pages will be re-added to page list, so we should be able to check > their refcount in shrink_page_list(). If the refcount is 1, the > refcount inc'ed by lru_add_page_tail() has been put by later > put_page(), we know it is freed under us since the only refcount comes > from isolation, we could just jump to "keep" (the label in > shrink_page_list()), then such page will be freed later by > shrink_inactive_list(). > > For MADV_PAGEOUT, I think we could add some logic to handle such page > after shrink_page_list(), just like what shrink_inactive_list() does. > > Migration already handles refcount == 1 page, so should not need any change. > > Is this idea feasible? Yes, but then we would have to loop over the tail pages twice, here and in shrink_page_list(), right? In addition, if we try to freeze the refcount of a page in shrink_page_list(), we couldn't be certain whether this page used to be a tail page. So we would have to test every page. If a page wasn't a tail page, it's unlikely for its refcount to drop unless there is a race. But this patch isn't really intended to optimize such a race. It's mainly for the next, i.e., we know there is a good chance to drop tail pages (~10% on our systems). Sounds reasonable? Thanks. > > + > > /* > > * Subpages may be freed if there wasn't any mapping > > * like if add_to_swap() is running on a lru page that > > @@ -2515,6 +2536,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > */ > > put_page(subpage); > > } > > + > > + if (!nr_pages_to_free) > > + return; > > + > > + mem_cgroup_uncharge_list(&pages_to_free); > > + free_unref_page_list(&pages_to_free); > > + count_vm_events(THP_SPLIT_FREE, nr_pages_to_free); > > } > > > > int total_mapcount(struct page *page) > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > index b0534e068166..f486e5d98d96 100644 > > --- a/mm/vmstat.c > > +++ b/mm/vmstat.c > > @@ -1300,6 +1300,7 @@ const char * const vmstat_text[] = { > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > "thp_split_pud", > > #endif > > + "thp_split_free", > > "thp_zero_page_alloc", > > "thp_zero_page_alloc_failed", > > "thp_swpout", > > -- > > 2.32.0.554.ge1b32706d8-goog > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] mm: free zapped tail pages when splitting isolated thp 2021-08-08 17:49 ` Yu Zhao @ 2021-08-11 22:25 ` Yang Shi 2021-08-11 23:12 ` Yu Zhao 0 siblings, 1 reply; 21+ messages in thread From: Yang Shi @ 2021-08-11 22:25 UTC (permalink / raw) To: Yu Zhao Cc: Linux MM, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Zi Yan, Linux Kernel Mailing List, Shuang Zhai On Sun, Aug 8, 2021 at 10:49 AM Yu Zhao <yuzhao@google.com> wrote: > > On Wed, Aug 4, 2021 at 6:13 PM Yang Shi <shy828301@gmail.com> wrote: > > > > On Fri, Jul 30, 2021 at 11:39 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > If a tail page has only two references left, one inherited from the > > > isolation of its head and the other from lru_add_page_tail() which we > > > are about to drop, it means this tail page was concurrently zapped. > > > Then we can safely free it and save page reclaim or migration the > > > trouble of trying it. > > > > > > Signed-off-by: Yu Zhao <yuzhao@google.com> > > > Tested-by: Shuang Zhai <zhais@google.com> > > > --- > > > include/linux/vm_event_item.h | 1 + > > > mm/huge_memory.c | 28 ++++++++++++++++++++++++++++ > > > mm/vmstat.c | 1 + > > > 3 files changed, 30 insertions(+) > > > > > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > > > index ae0dd1948c2b..829eeac84094 100644 > > > --- a/include/linux/vm_event_item.h > > > +++ b/include/linux/vm_event_item.h > > > @@ -99,6 +99,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > THP_SPLIT_PUD, > > > #endif > > > + THP_SPLIT_FREE, > > > THP_ZERO_PAGE_ALLOC, > > > THP_ZERO_PAGE_ALLOC_FAILED, > > > THP_SWPOUT, > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > index d8b655856e79..5120478bca41 100644 > > > --- a/mm/huge_memory.c > > > +++ b/mm/huge_memory.c > > > @@ -2432,6 +2432,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > struct address_space *swap_cache = NULL; > > > unsigned long offset = 0; > > > unsigned int nr = thp_nr_pages(head); > > > + LIST_HEAD(pages_to_free); > > > + int nr_pages_to_free = 0; > > > int i; > > > > > > VM_BUG_ON_PAGE(list && PageLRU(head), head); > > > @@ -2506,6 +2508,25 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > continue; > > > unlock_page(subpage); > > > > > > + /* > > > + * If a tail page has only two references left, one inherited > > > + * from the isolation of its head and the other from > > > + * lru_add_page_tail() which we are about to drop, it means this > > > + * tail page was concurrently zapped. Then we can safely free it > > > + * and save page reclaim or migration the trouble of trying it. > > > + */ > > > + if (list && page_ref_freeze(subpage, 2)) { > > > + VM_BUG_ON_PAGE(PageLRU(subpage), subpage); > > > + VM_BUG_ON_PAGE(PageCompound(subpage), subpage); > > > + VM_BUG_ON_PAGE(page_mapped(subpage), subpage); > > > + > > > + ClearPageActive(subpage); > > > + ClearPageUnevictable(subpage); > > > + list_move(&subpage->lru, &pages_to_free); > > > + nr_pages_to_free++; > > > + continue; > > > + } > > > > Yes, such page could be freed instead of swapping out. But I'm > > wondering if we could have some simpler implementation. Since such > > pages will be re-added to page list, so we should be able to check > > their refcount in shrink_page_list(). If the refcount is 1, the > > refcount inc'ed by lru_add_page_tail() has been put by later > > put_page(), we know it is freed under us since the only refcount comes > > from isolation, we could just jump to "keep" (the label in > > shrink_page_list()), then such page will be freed later by > > shrink_inactive_list(). > > > > For MADV_PAGEOUT, I think we could add some logic to handle such page > > after shrink_page_list(), just like what shrink_inactive_list() does. > > > > Migration already handles refcount == 1 page, so should not need any change. > > > > Is this idea feasible? > > Yes, but then we would have to loop over the tail pages twice, here > and in shrink_page_list(), right? I don't quite get what you mean "loop over the tail pages twice". Once THP is isolated then get split, all the tail pages will be put on the list (local list for isolated pages), then the reclaimer would deal with the head page, then continue to iterate the list to deal with tail pages. Your patch could free the tail pages earlier. But it should not make too much difference to free the tail pages a little bit later IMHO. > > In addition, if we try to freeze the refcount of a page in > shrink_page_list(), we couldn't be certain whether this page used to > be a tail page. So we would have to test every page. If a page wasn't > a tail page, it's unlikely for its refcount to drop unless there is a > race. But this patch isn't really intended to optimize such a race. > It's mainly for the next, i.e., we know there is a good chance to drop > tail pages (~10% on our systems). Sounds reasonable? Thanks. I'm not sure what is the main source of the partial mapped THPs from your fleets. But if most of them are generated by MADV_DONTNEED (this is used by some userspace memory allocator libs), they should be on deferred split list too. Currently deferred split shrinker just shrinks those THPs (simply split them and free unmapped sub pages) proportionally, we definitely could shrink them more aggressively, for example, by setting shrinker->seeks to 0. I'm wondering if this will achieve a similar effect or not. I really don't have any objection to free such pages, but just wondering if we could have something simpler or not. > > > > + > > > /* > > > * Subpages may be freed if there wasn't any mapping > > > * like if add_to_swap() is running on a lru page that > > > @@ -2515,6 +2536,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > */ > > > put_page(subpage); > > > } > > > + > > > + if (!nr_pages_to_free) > > > + return; > > > + > > > + mem_cgroup_uncharge_list(&pages_to_free); > > > + free_unref_page_list(&pages_to_free); > > > + count_vm_events(THP_SPLIT_FREE, nr_pages_to_free); > > > } > > > > > > int total_mapcount(struct page *page) > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > > index b0534e068166..f486e5d98d96 100644 > > > --- a/mm/vmstat.c > > > +++ b/mm/vmstat.c > > > @@ -1300,6 +1300,7 @@ const char * const vmstat_text[] = { > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > "thp_split_pud", > > > #endif > > > + "thp_split_free", > > > "thp_zero_page_alloc", > > > "thp_zero_page_alloc_failed", > > > "thp_swpout", > > > -- > > > 2.32.0.554.ge1b32706d8-goog > > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] mm: free zapped tail pages when splitting isolated thp 2021-08-11 22:25 ` Yang Shi @ 2021-08-11 23:12 ` Yu Zhao 2021-08-13 23:24 ` Yang Shi 0 siblings, 1 reply; 21+ messages in thread From: Yu Zhao @ 2021-08-11 23:12 UTC (permalink / raw) To: Yang Shi Cc: Linux MM, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Zi Yan, Linux Kernel Mailing List, Shuang Zhai On Wed, Aug 11, 2021 at 4:25 PM Yang Shi <shy828301@gmail.com> wrote: > > On Sun, Aug 8, 2021 at 10:49 AM Yu Zhao <yuzhao@google.com> wrote: > > > > On Wed, Aug 4, 2021 at 6:13 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > On Fri, Jul 30, 2021 at 11:39 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > If a tail page has only two references left, one inherited from the > > > > isolation of its head and the other from lru_add_page_tail() which we > > > > are about to drop, it means this tail page was concurrently zapped. > > > > Then we can safely free it and save page reclaim or migration the > > > > trouble of trying it. > > > > > > > > Signed-off-by: Yu Zhao <yuzhao@google.com> > > > > Tested-by: Shuang Zhai <zhais@google.com> > > > > --- > > > > include/linux/vm_event_item.h | 1 + > > > > mm/huge_memory.c | 28 ++++++++++++++++++++++++++++ > > > > mm/vmstat.c | 1 + > > > > 3 files changed, 30 insertions(+) > > > > > > > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > > > > index ae0dd1948c2b..829eeac84094 100644 > > > > --- a/include/linux/vm_event_item.h > > > > +++ b/include/linux/vm_event_item.h > > > > @@ -99,6 +99,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > THP_SPLIT_PUD, > > > > #endif > > > > + THP_SPLIT_FREE, > > > > THP_ZERO_PAGE_ALLOC, > > > > THP_ZERO_PAGE_ALLOC_FAILED, > > > > THP_SWPOUT, > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > index d8b655856e79..5120478bca41 100644 > > > > --- a/mm/huge_memory.c > > > > +++ b/mm/huge_memory.c > > > > @@ -2432,6 +2432,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > struct address_space *swap_cache = NULL; > > > > unsigned long offset = 0; > > > > unsigned int nr = thp_nr_pages(head); > > > > + LIST_HEAD(pages_to_free); > > > > + int nr_pages_to_free = 0; > > > > int i; > > > > > > > > VM_BUG_ON_PAGE(list && PageLRU(head), head); > > > > @@ -2506,6 +2508,25 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > continue; > > > > unlock_page(subpage); > > > > > > > > + /* > > > > + * If a tail page has only two references left, one inherited > > > > + * from the isolation of its head and the other from > > > > + * lru_add_page_tail() which we are about to drop, it means this > > > > + * tail page was concurrently zapped. Then we can safely free it > > > > + * and save page reclaim or migration the trouble of trying it. > > > > + */ > > > > + if (list && page_ref_freeze(subpage, 2)) { > > > > + VM_BUG_ON_PAGE(PageLRU(subpage), subpage); > > > > + VM_BUG_ON_PAGE(PageCompound(subpage), subpage); > > > > + VM_BUG_ON_PAGE(page_mapped(subpage), subpage); > > > > + > > > > + ClearPageActive(subpage); > > > > + ClearPageUnevictable(subpage); > > > > + list_move(&subpage->lru, &pages_to_free); > > > > + nr_pages_to_free++; > > > > + continue; > > > > + } > > > > > > Yes, such page could be freed instead of swapping out. But I'm > > > wondering if we could have some simpler implementation. Since such > > > pages will be re-added to page list, so we should be able to check > > > their refcount in shrink_page_list(). If the refcount is 1, the > > > refcount inc'ed by lru_add_page_tail() has been put by later > > > put_page(), we know it is freed under us since the only refcount comes > > > from isolation, we could just jump to "keep" (the label in > > > shrink_page_list()), then such page will be freed later by > > > shrink_inactive_list(). > > > > > > For MADV_PAGEOUT, I think we could add some logic to handle such page > > > after shrink_page_list(), just like what shrink_inactive_list() does. > > > > > > Migration already handles refcount == 1 page, so should not need any change. > > > > > > Is this idea feasible? > > > > Yes, but then we would have to loop over the tail pages twice, here > > and in shrink_page_list(), right? > > I don't quite get what you mean "loop over the tail pages twice". Once > THP is isolated then get split, all the tail pages will be put on the > list (local list for isolated pages), then the reclaimer would deal > with the head page, then continue to iterate the list to deal with > tail pages. Your patch could free the tail pages earlier. But it > should not make too much difference to free the tail pages a little > bit later IMHO. We are in a (the first) loop here. If we free the tail pages later, then we will need to loop over them again (the second). IOW, 1) __split_huge_page(): for each of the 511 tail pages (first loop). 2) shrink_page_list(): for each of the 511 tail pages (second loop). > > In addition, if we try to freeze the refcount of a page in > > shrink_page_list(), we couldn't be certain whether this page used to > > be a tail page. So we would have to test every page. If a page wasn't > > a tail page, it's unlikely for its refcount to drop unless there is a > > race. But this patch isn't really intended to optimize such a race. > > It's mainly for the next, i.e., we know there is a good chance to drop > > tail pages (~10% on our systems). Sounds reasonable? Thanks. > > I'm not sure what is the main source of the partial mapped THPs from > your fleets. But if most of them are generated by MADV_DONTNEED (this > is used by some userspace memory allocator libs), they should be on > deferred split list too. Currently deferred split shrinker just > shrinks those THPs (simply split them and free unmapped sub pages) > proportionally, we definitely could shrink them more aggressively, for > example, by setting shrinker->seeks to 0. I'm wondering if this will > achieve a similar effect or not. Not partially mapped but internal fragmentation. IOW, some of the 4KB pages within a THP were never written into, which can be common depending on the implementations of userspace memory allocators. > I really don't have any objection to free such pages, but just > wondering if we could have something simpler or not. Thanks. > > > > + > > > > /* > > > > * Subpages may be freed if there wasn't any mapping > > > > * like if add_to_swap() is running on a lru page that > > > > @@ -2515,6 +2536,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > */ > > > > put_page(subpage); > > > > } > > > > + > > > > + if (!nr_pages_to_free) > > > > + return; > > > > + > > > > + mem_cgroup_uncharge_list(&pages_to_free); > > > > + free_unref_page_list(&pages_to_free); > > > > + count_vm_events(THP_SPLIT_FREE, nr_pages_to_free); > > > > } > > > > > > > > int total_mapcount(struct page *page) > > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > > > index b0534e068166..f486e5d98d96 100644 > > > > --- a/mm/vmstat.c > > > > +++ b/mm/vmstat.c > > > > @@ -1300,6 +1300,7 @@ const char * const vmstat_text[] = { > > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > "thp_split_pud", > > > > #endif > > > > + "thp_split_free", > > > > "thp_zero_page_alloc", > > > > "thp_zero_page_alloc_failed", > > > > "thp_swpout", > > > > -- > > > > 2.32.0.554.ge1b32706d8-goog > > > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] mm: free zapped tail pages when splitting isolated thp 2021-08-11 23:12 ` Yu Zhao @ 2021-08-13 23:24 ` Yang Shi 2021-08-13 23:56 ` Yu Zhao 0 siblings, 1 reply; 21+ messages in thread From: Yang Shi @ 2021-08-13 23:24 UTC (permalink / raw) To: Yu Zhao Cc: Linux MM, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Zi Yan, Linux Kernel Mailing List, Shuang Zhai On Wed, Aug 11, 2021 at 4:12 PM Yu Zhao <yuzhao@google.com> wrote: > > On Wed, Aug 11, 2021 at 4:25 PM Yang Shi <shy828301@gmail.com> wrote: > > > > On Sun, Aug 8, 2021 at 10:49 AM Yu Zhao <yuzhao@google.com> wrote: > > > > > > On Wed, Aug 4, 2021 at 6:13 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > On Fri, Jul 30, 2021 at 11:39 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > > > If a tail page has only two references left, one inherited from the > > > > > isolation of its head and the other from lru_add_page_tail() which we > > > > > are about to drop, it means this tail page was concurrently zapped. > > > > > Then we can safely free it and save page reclaim or migration the > > > > > trouble of trying it. > > > > > > > > > > Signed-off-by: Yu Zhao <yuzhao@google.com> > > > > > Tested-by: Shuang Zhai <zhais@google.com> > > > > > --- > > > > > include/linux/vm_event_item.h | 1 + > > > > > mm/huge_memory.c | 28 ++++++++++++++++++++++++++++ > > > > > mm/vmstat.c | 1 + > > > > > 3 files changed, 30 insertions(+) > > > > > > > > > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > > > > > index ae0dd1948c2b..829eeac84094 100644 > > > > > --- a/include/linux/vm_event_item.h > > > > > +++ b/include/linux/vm_event_item.h > > > > > @@ -99,6 +99,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > > > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > > THP_SPLIT_PUD, > > > > > #endif > > > > > + THP_SPLIT_FREE, > > > > > THP_ZERO_PAGE_ALLOC, > > > > > THP_ZERO_PAGE_ALLOC_FAILED, > > > > > THP_SWPOUT, > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > index d8b655856e79..5120478bca41 100644 > > > > > --- a/mm/huge_memory.c > > > > > +++ b/mm/huge_memory.c > > > > > @@ -2432,6 +2432,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > struct address_space *swap_cache = NULL; > > > > > unsigned long offset = 0; > > > > > unsigned int nr = thp_nr_pages(head); > > > > > + LIST_HEAD(pages_to_free); > > > > > + int nr_pages_to_free = 0; > > > > > int i; > > > > > > > > > > VM_BUG_ON_PAGE(list && PageLRU(head), head); > > > > > @@ -2506,6 +2508,25 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > continue; > > > > > unlock_page(subpage); > > > > > > > > > > + /* > > > > > + * If a tail page has only two references left, one inherited > > > > > + * from the isolation of its head and the other from > > > > > + * lru_add_page_tail() which we are about to drop, it means this > > > > > + * tail page was concurrently zapped. Then we can safely free it > > > > > + * and save page reclaim or migration the trouble of trying it. > > > > > + */ > > > > > + if (list && page_ref_freeze(subpage, 2)) { > > > > > + VM_BUG_ON_PAGE(PageLRU(subpage), subpage); > > > > > + VM_BUG_ON_PAGE(PageCompound(subpage), subpage); > > > > > + VM_BUG_ON_PAGE(page_mapped(subpage), subpage); > > > > > + > > > > > + ClearPageActive(subpage); > > > > > + ClearPageUnevictable(subpage); > > > > > + list_move(&subpage->lru, &pages_to_free); > > > > > + nr_pages_to_free++; > > > > > + continue; > > > > > + } > > > > > > > > Yes, such page could be freed instead of swapping out. But I'm > > > > wondering if we could have some simpler implementation. Since such > > > > pages will be re-added to page list, so we should be able to check > > > > their refcount in shrink_page_list(). If the refcount is 1, the > > > > refcount inc'ed by lru_add_page_tail() has been put by later > > > > put_page(), we know it is freed under us since the only refcount comes > > > > from isolation, we could just jump to "keep" (the label in > > > > shrink_page_list()), then such page will be freed later by > > > > shrink_inactive_list(). > > > > > > > > For MADV_PAGEOUT, I think we could add some logic to handle such page > > > > after shrink_page_list(), just like what shrink_inactive_list() does. > > > > > > > > Migration already handles refcount == 1 page, so should not need any change. > > > > > > > > Is this idea feasible? > > > > > > Yes, but then we would have to loop over the tail pages twice, here > > > and in shrink_page_list(), right? > > > > I don't quite get what you mean "loop over the tail pages twice". Once > > THP is isolated then get split, all the tail pages will be put on the > > list (local list for isolated pages), then the reclaimer would deal > > with the head page, then continue to iterate the list to deal with > > tail pages. Your patch could free the tail pages earlier. But it > > should not make too much difference to free the tail pages a little > > bit later IMHO. > > We are in a (the first) loop here. If we free the tail pages later, > then we will need to loop over them again (the second). > > IOW, > 1) __split_huge_page(): for each of the 511 tail pages (first loop). > 2) shrink_page_list(): for each of the 511 tail pages (second loop). > > > > In addition, if we try to freeze the refcount of a page in > > > shrink_page_list(), we couldn't be certain whether this page used to > > > be a tail page. So we would have to test every page. If a page wasn't > > > a tail page, it's unlikely for its refcount to drop unless there is a > > > race. But this patch isn't really intended to optimize such a race. > > > It's mainly for the next, i.e., we know there is a good chance to drop > > > tail pages (~10% on our systems). Sounds reasonable? Thanks. > > > > I'm not sure what is the main source of the partial mapped THPs from > > your fleets. But if most of them are generated by MADV_DONTNEED (this > > is used by some userspace memory allocator libs), they should be on > > deferred split list too. Currently deferred split shrinker just > > shrinks those THPs (simply split them and free unmapped sub pages) > > proportionally, we definitely could shrink them more aggressively, for > > example, by setting shrinker->seeks to 0. I'm wondering if this will > > achieve a similar effect or not. > > Not partially mapped but internal fragmentation. > > IOW, some of the 4KB pages within a THP were never written into, which > can be common depending on the implementations of userspace memory > allocators. OK, this is actually what the patch #3 does. The patch #3 just doesn't remap the "all zero" page when splitting the THP IIUC. But the page has refcount from isolation so it can't be simply freed by put_page(). Actually this makes me think my suggestion is better. It doesn't make too much sense to me to have page free logic (manipulate flags, uncharge memcg, etc) in THP split. There have been a couple of places to handle such cases: - deferred split shrinker: the unmapped subpage is just freed by put_page() since there is no extra refcount - migration: check page refcount then free the refcount == 1 one Here you add the third case in the page reclaim path, so why not just let the page reclaim handle all the work for freeing page? > > > I really don't have any objection to free such pages, but just > > wondering if we could have something simpler or not. > > Thanks. > > > > > > + > > > > > /* > > > > > * Subpages may be freed if there wasn't any mapping > > > > > * like if add_to_swap() is running on a lru page that > > > > > @@ -2515,6 +2536,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > */ > > > > > put_page(subpage); > > > > > } > > > > > + > > > > > + if (!nr_pages_to_free) > > > > > + return; > > > > > + > > > > > + mem_cgroup_uncharge_list(&pages_to_free); > > > > > + free_unref_page_list(&pages_to_free); > > > > > + count_vm_events(THP_SPLIT_FREE, nr_pages_to_free); > > > > > } > > > > > > > > > > int total_mapcount(struct page *page) > > > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > > > > index b0534e068166..f486e5d98d96 100644 > > > > > --- a/mm/vmstat.c > > > > > +++ b/mm/vmstat.c > > > > > @@ -1300,6 +1300,7 @@ const char * const vmstat_text[] = { > > > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > > "thp_split_pud", > > > > > #endif > > > > > + "thp_split_free", > > > > > "thp_zero_page_alloc", > > > > > "thp_zero_page_alloc_failed", > > > > > "thp_swpout", > > > > > -- > > > > > 2.32.0.554.ge1b32706d8-goog > > > > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] mm: free zapped tail pages when splitting isolated thp 2021-08-13 23:24 ` Yang Shi @ 2021-08-13 23:56 ` Yu Zhao 2021-08-14 0:30 ` Yang Shi 0 siblings, 1 reply; 21+ messages in thread From: Yu Zhao @ 2021-08-13 23:56 UTC (permalink / raw) To: Yang Shi Cc: Linux MM, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Zi Yan, Linux Kernel Mailing List, Shuang Zhai () On Fri, Aug 13, 2021 at 5:24 PM Yang Shi <shy828301@gmail.com> wrote: > > On Wed, Aug 11, 2021 at 4:12 PM Yu Zhao <yuzhao@google.com> wrote: > > > > On Wed, Aug 11, 2021 at 4:25 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > On Sun, Aug 8, 2021 at 10:49 AM Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > On Wed, Aug 4, 2021 at 6:13 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > On Fri, Jul 30, 2021 at 11:39 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > > > > > If a tail page has only two references left, one inherited from the > > > > > > isolation of its head and the other from lru_add_page_tail() which we > > > > > > are about to drop, it means this tail page was concurrently zapped. > > > > > > Then we can safely free it and save page reclaim or migration the > > > > > > trouble of trying it. > > > > > > > > > > > > Signed-off-by: Yu Zhao <yuzhao@google.com> > > > > > > Tested-by: Shuang Zhai <zhais@google.com> > > > > > > --- > > > > > > include/linux/vm_event_item.h | 1 + > > > > > > mm/huge_memory.c | 28 ++++++++++++++++++++++++++++ > > > > > > mm/vmstat.c | 1 + > > > > > > 3 files changed, 30 insertions(+) > > > > > > > > > > > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > > > > > > index ae0dd1948c2b..829eeac84094 100644 > > > > > > --- a/include/linux/vm_event_item.h > > > > > > +++ b/include/linux/vm_event_item.h > > > > > > @@ -99,6 +99,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > > > > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > > > THP_SPLIT_PUD, > > > > > > #endif > > > > > > + THP_SPLIT_FREE, > > > > > > THP_ZERO_PAGE_ALLOC, > > > > > > THP_ZERO_PAGE_ALLOC_FAILED, > > > > > > THP_SWPOUT, > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > > index d8b655856e79..5120478bca41 100644 > > > > > > --- a/mm/huge_memory.c > > > > > > +++ b/mm/huge_memory.c > > > > > > @@ -2432,6 +2432,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > > struct address_space *swap_cache = NULL; > > > > > > unsigned long offset = 0; > > > > > > unsigned int nr = thp_nr_pages(head); > > > > > > + LIST_HEAD(pages_to_free); > > > > > > + int nr_pages_to_free = 0; > > > > > > int i; > > > > > > > > > > > > VM_BUG_ON_PAGE(list && PageLRU(head), head); > > > > > > @@ -2506,6 +2508,25 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > > continue; > > > > > > unlock_page(subpage); > > > > > > > > > > > > + /* > > > > > > + * If a tail page has only two references left, one inherited > > > > > > + * from the isolation of its head and the other from > > > > > > + * lru_add_page_tail() which we are about to drop, it means this > > > > > > + * tail page was concurrently zapped. Then we can safely free it > > > > > > + * and save page reclaim or migration the trouble of trying it. > > > > > > + */ > > > > > > + if (list && page_ref_freeze(subpage, 2)) { > > > > > > + VM_BUG_ON_PAGE(PageLRU(subpage), subpage); > > > > > > + VM_BUG_ON_PAGE(PageCompound(subpage), subpage); > > > > > > + VM_BUG_ON_PAGE(page_mapped(subpage), subpage); > > > > > > + > > > > > > + ClearPageActive(subpage); > > > > > > + ClearPageUnevictable(subpage); > > > > > > + list_move(&subpage->lru, &pages_to_free); > > > > > > + nr_pages_to_free++; > > > > > > + continue; > > > > > > + } > > > > > > > > > > Yes, such page could be freed instead of swapping out. But I'm > > > > > wondering if we could have some simpler implementation. Since such > > > > > pages will be re-added to page list, so we should be able to check > > > > > their refcount in shrink_page_list(). If the refcount is 1, the > > > > > refcount inc'ed by lru_add_page_tail() has been put by later > > > > > put_page(), we know it is freed under us since the only refcount comes > > > > > from isolation, we could just jump to "keep" (the label in > > > > > shrink_page_list()), then such page will be freed later by > > > > > shrink_inactive_list(). > > > > > > > > > > For MADV_PAGEOUT, I think we could add some logic to handle such page > > > > > after shrink_page_list(), just like what shrink_inactive_list() does. > > > > > > > > > > Migration already handles refcount == 1 page, so should not need any change. > > > > > > > > > > Is this idea feasible? > > > > > > > > Yes, but then we would have to loop over the tail pages twice, here > > > > and in shrink_page_list(), right? > > > > > > I don't quite get what you mean "loop over the tail pages twice". Once > > > THP is isolated then get split, all the tail pages will be put on the > > > list (local list for isolated pages), then the reclaimer would deal > > > with the head page, then continue to iterate the list to deal with > > > tail pages. Your patch could free the tail pages earlier. But it > > > should not make too much difference to free the tail pages a little > > > bit later IMHO. > > > > We are in a (the first) loop here. If we free the tail pages later, > > then we will need to loop over them again (the second). > > > > IOW, > > 1) __split_huge_page(): for each of the 511 tail pages (first loop). > > 2) shrink_page_list(): for each of the 511 tail pages (second loop). > > > > > > In addition, if we try to freeze the refcount of a page in > > > > shrink_page_list(), we couldn't be certain whether this page used to > > > > be a tail page. So we would have to test every page. If a page wasn't > > > > a tail page, it's unlikely for its refcount to drop unless there is a > > > > race. But this patch isn't really intended to optimize such a race. > > > > It's mainly for the next, i.e., we know there is a good chance to drop > > > > tail pages (~10% on our systems). Sounds reasonable? Thanks. > > > > > > I'm not sure what is the main source of the partial mapped THPs from > > > your fleets. But if most of them are generated by MADV_DONTNEED (this > > > is used by some userspace memory allocator libs), they should be on > > > deferred split list too. Currently deferred split shrinker just > > > shrinks those THPs (simply split them and free unmapped sub pages) > > > proportionally, we definitely could shrink them more aggressively, for > > > example, by setting shrinker->seeks to 0. I'm wondering if this will > > > achieve a similar effect or not. > > > > Not partially mapped but internal fragmentation. > > > > IOW, some of the 4KB pages within a THP were never written into, which > > can be common depending on the implementations of userspace memory > > allocators. > > OK, this is actually what the patch #3 does. The patch #3 just doesn't > remap the "all zero" page when splitting the THP IIUC. But the page > has refcount from isolation so it can't be simply freed by put_page(). > > Actually this makes me think my suggestion is better. It doesn't make > too much sense to me to have page free logic (manipulate flags, > uncharge memcg, etc) in THP split. > > There have been a couple of places to handle such cases: > - deferred split shrinker: the unmapped subpage is just freed by > put_page() since there is no extra refcount > - migration: check page refcount then free the refcount == 1 one > > Here you add the third case in the page reclaim path, so why not just > let the page reclaim handle all the work for freeing page? As I have explained previously: 1) We would have to loop over tail pages twice. Not much overhead but unnecessary. 2) We would have to try to freeze the refcount on _every_ page in shrink_page_list() -- shrink_page_list() takes all pages, not just relevant ones (previously being tail). Attempting to freeze refcount on an irrelevant page will likely fail. Again, not a significant overhead but better to avoid. I'm not against your idea. But I'd like to hear some clarifications about the points above. That is whether you think it's still a good idea to do what you suggested after taking these into account. > > > I really don't have any objection to free such pages, but just > > > wondering if we could have something simpler or not. > > > > Thanks. > > > > > > > > + > > > > > > /* > > > > > > * Subpages may be freed if there wasn't any mapping > > > > > > * like if add_to_swap() is running on a lru page that > > > > > > @@ -2515,6 +2536,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > > */ > > > > > > put_page(subpage); > > > > > > } > > > > > > + > > > > > > + if (!nr_pages_to_free) > > > > > > + return; > > > > > > + > > > > > > + mem_cgroup_uncharge_list(&pages_to_free); > > > > > > + free_unref_page_list(&pages_to_free); > > > > > > + count_vm_events(THP_SPLIT_FREE, nr_pages_to_free); > > > > > > } > > > > > > > > > > > > int total_mapcount(struct page *page) > > > > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > > > > > index b0534e068166..f486e5d98d96 100644 > > > > > > --- a/mm/vmstat.c > > > > > > +++ b/mm/vmstat.c > > > > > > @@ -1300,6 +1300,7 @@ const char * const vmstat_text[] = { > > > > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > > > "thp_split_pud", > > > > > > #endif > > > > > > + "thp_split_free", > > > > > > "thp_zero_page_alloc", > > > > > > "thp_zero_page_alloc_failed", > > > > > > "thp_swpout", > > > > > > -- > > > > > > 2.32.0.554.ge1b32706d8-goog > > > > > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] mm: free zapped tail pages when splitting isolated thp 2021-08-13 23:56 ` Yu Zhao @ 2021-08-14 0:30 ` Yang Shi 2021-08-14 1:49 ` Yu Zhao 0 siblings, 1 reply; 21+ messages in thread From: Yang Shi @ 2021-08-14 0:30 UTC (permalink / raw) To: Yu Zhao Cc: Linux MM, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Zi Yan, Linux Kernel Mailing List, Shuang Zhai On Fri, Aug 13, 2021 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote: > > () > On Fri, Aug 13, 2021 at 5:24 PM Yang Shi <shy828301@gmail.com> wrote: > > > > On Wed, Aug 11, 2021 at 4:12 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > On Wed, Aug 11, 2021 at 4:25 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > On Sun, Aug 8, 2021 at 10:49 AM Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > > > On Wed, Aug 4, 2021 at 6:13 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > On Fri, Jul 30, 2021 at 11:39 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > > > > > > > If a tail page has only two references left, one inherited from the > > > > > > > isolation of its head and the other from lru_add_page_tail() which we > > > > > > > are about to drop, it means this tail page was concurrently zapped. > > > > > > > Then we can safely free it and save page reclaim or migration the > > > > > > > trouble of trying it. > > > > > > > > > > > > > > Signed-off-by: Yu Zhao <yuzhao@google.com> > > > > > > > Tested-by: Shuang Zhai <zhais@google.com> > > > > > > > --- > > > > > > > include/linux/vm_event_item.h | 1 + > > > > > > > mm/huge_memory.c | 28 ++++++++++++++++++++++++++++ > > > > > > > mm/vmstat.c | 1 + > > > > > > > 3 files changed, 30 insertions(+) > > > > > > > > > > > > > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > > > > > > > index ae0dd1948c2b..829eeac84094 100644 > > > > > > > --- a/include/linux/vm_event_item.h > > > > > > > +++ b/include/linux/vm_event_item.h > > > > > > > @@ -99,6 +99,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > > > > > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > > > > THP_SPLIT_PUD, > > > > > > > #endif > > > > > > > + THP_SPLIT_FREE, > > > > > > > THP_ZERO_PAGE_ALLOC, > > > > > > > THP_ZERO_PAGE_ALLOC_FAILED, > > > > > > > THP_SWPOUT, > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > > > index d8b655856e79..5120478bca41 100644 > > > > > > > --- a/mm/huge_memory.c > > > > > > > +++ b/mm/huge_memory.c > > > > > > > @@ -2432,6 +2432,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > > > struct address_space *swap_cache = NULL; > > > > > > > unsigned long offset = 0; > > > > > > > unsigned int nr = thp_nr_pages(head); > > > > > > > + LIST_HEAD(pages_to_free); > > > > > > > + int nr_pages_to_free = 0; > > > > > > > int i; > > > > > > > > > > > > > > VM_BUG_ON_PAGE(list && PageLRU(head), head); > > > > > > > @@ -2506,6 +2508,25 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > > > continue; > > > > > > > unlock_page(subpage); > > > > > > > > > > > > > > + /* > > > > > > > + * If a tail page has only two references left, one inherited > > > > > > > + * from the isolation of its head and the other from > > > > > > > + * lru_add_page_tail() which we are about to drop, it means this > > > > > > > + * tail page was concurrently zapped. Then we can safely free it > > > > > > > + * and save page reclaim or migration the trouble of trying it. > > > > > > > + */ > > > > > > > + if (list && page_ref_freeze(subpage, 2)) { > > > > > > > + VM_BUG_ON_PAGE(PageLRU(subpage), subpage); > > > > > > > + VM_BUG_ON_PAGE(PageCompound(subpage), subpage); > > > > > > > + VM_BUG_ON_PAGE(page_mapped(subpage), subpage); > > > > > > > + > > > > > > > + ClearPageActive(subpage); > > > > > > > + ClearPageUnevictable(subpage); > > > > > > > + list_move(&subpage->lru, &pages_to_free); > > > > > > > + nr_pages_to_free++; > > > > > > > + continue; > > > > > > > + } > > > > > > > > > > > > Yes, such page could be freed instead of swapping out. But I'm > > > > > > wondering if we could have some simpler implementation. Since such > > > > > > pages will be re-added to page list, so we should be able to check > > > > > > their refcount in shrink_page_list(). If the refcount is 1, the > > > > > > refcount inc'ed by lru_add_page_tail() has been put by later > > > > > > put_page(), we know it is freed under us since the only refcount comes > > > > > > from isolation, we could just jump to "keep" (the label in > > > > > > shrink_page_list()), then such page will be freed later by > > > > > > shrink_inactive_list(). > > > > > > > > > > > > For MADV_PAGEOUT, I think we could add some logic to handle such page > > > > > > after shrink_page_list(), just like what shrink_inactive_list() does. > > > > > > > > > > > > Migration already handles refcount == 1 page, so should not need any change. > > > > > > > > > > > > Is this idea feasible? > > > > > > > > > > Yes, but then we would have to loop over the tail pages twice, here > > > > > and in shrink_page_list(), right? > > > > > > > > I don't quite get what you mean "loop over the tail pages twice". Once > > > > THP is isolated then get split, all the tail pages will be put on the > > > > list (local list for isolated pages), then the reclaimer would deal > > > > with the head page, then continue to iterate the list to deal with > > > > tail pages. Your patch could free the tail pages earlier. But it > > > > should not make too much difference to free the tail pages a little > > > > bit later IMHO. > > > > > > We are in a (the first) loop here. If we free the tail pages later, > > > then we will need to loop over them again (the second). > > > > > > IOW, > > > 1) __split_huge_page(): for each of the 511 tail pages (first loop). > > > 2) shrink_page_list(): for each of the 511 tail pages (second loop). > > > > > > > > In addition, if we try to freeze the refcount of a page in > > > > > shrink_page_list(), we couldn't be certain whether this page used to > > > > > be a tail page. So we would have to test every page. If a page wasn't > > > > > a tail page, it's unlikely for its refcount to drop unless there is a > > > > > race. But this patch isn't really intended to optimize such a race. > > > > > It's mainly for the next, i.e., we know there is a good chance to drop > > > > > tail pages (~10% on our systems). Sounds reasonable? Thanks. > > > > > > > > I'm not sure what is the main source of the partial mapped THPs from > > > > your fleets. But if most of them are generated by MADV_DONTNEED (this > > > > is used by some userspace memory allocator libs), they should be on > > > > deferred split list too. Currently deferred split shrinker just > > > > shrinks those THPs (simply split them and free unmapped sub pages) > > > > proportionally, we definitely could shrink them more aggressively, for > > > > example, by setting shrinker->seeks to 0. I'm wondering if this will > > > > achieve a similar effect or not. > > > > > > Not partially mapped but internal fragmentation. > > > > > > IOW, some of the 4KB pages within a THP were never written into, which > > > can be common depending on the implementations of userspace memory > > > allocators. > > > > OK, this is actually what the patch #3 does. The patch #3 just doesn't > > remap the "all zero" page when splitting the THP IIUC. But the page > > has refcount from isolation so it can't be simply freed by put_page(). > > > > Actually this makes me think my suggestion is better. It doesn't make > > too much sense to me to have page free logic (manipulate flags, > > uncharge memcg, etc) in THP split. > > > > There have been a couple of places to handle such cases: > > - deferred split shrinker: the unmapped subpage is just freed by > > put_page() since there is no extra refcount > > - migration: check page refcount then free the refcount == 1 one > > > > Here you add the third case in the page reclaim path, so why not just > > let the page reclaim handle all the work for freeing page? > > As I have explained previously: > > 1) We would have to loop over tail pages twice. Not much overhead but > unnecessary. > 2) We would have to try to freeze the refcount on _every_ page in > shrink_page_list() -- shrink_page_list() takes all pages, not just > relevant ones (previously being tail). Attempting to freeze refcount > on an irrelevant page will likely fail. Again, not a significant > overhead but better to avoid. IIUC you don't need to freeze the refcount, such page is not in swap cache, they don't have mapping. I'm supposed you just need simply do: diff --git a/mm/vmscan.c b/mm/vmscan.c index 403a175a720f..031b98627a02 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1547,6 +1547,9 @@ static unsigned int shrink_page_list(struct list_head *page_list, * Lazyfree page could be freed directly */ if (PageAnon(page) && PageSwapBacked(page)) { + if (page_count(page) == 1) + goto locked; + if (!PageSwapCache(page)) { if (!(sc->gfp_mask & __GFP_IO)) goto keep_locked; It is an unmapped anonymous page, nobody could see it other than hwpoison handler AFAICT. > > I'm not against your idea. But I'd like to hear some clarifications > about the points above. That is whether you think it's still a good > idea to do what you suggested after taking these into account. I personally don't feel very comfortable to have the extra freeing page logic in THP split when we could leverage page reclaim code with acceptable overhead. And migration code already did so. > > > > > I really don't have any objection to free such pages, but just > > > > wondering if we could have something simpler or not. > > > > > > Thanks. > > > > > > > > > > + > > > > > > > /* > > > > > > > * Subpages may be freed if there wasn't any mapping > > > > > > > * like if add_to_swap() is running on a lru page that > > > > > > > @@ -2515,6 +2536,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > > > */ > > > > > > > put_page(subpage); > > > > > > > } > > > > > > > + > > > > > > > + if (!nr_pages_to_free) > > > > > > > + return; > > > > > > > + > > > > > > > + mem_cgroup_uncharge_list(&pages_to_free); > > > > > > > + free_unref_page_list(&pages_to_free); > > > > > > > + count_vm_events(THP_SPLIT_FREE, nr_pages_to_free); > > > > > > > } > > > > > > > > > > > > > > int total_mapcount(struct page *page) > > > > > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > > > > > > index b0534e068166..f486e5d98d96 100644 > > > > > > > --- a/mm/vmstat.c > > > > > > > +++ b/mm/vmstat.c > > > > > > > @@ -1300,6 +1300,7 @@ const char * const vmstat_text[] = { > > > > > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > > > > "thp_split_pud", > > > > > > > #endif > > > > > > > + "thp_split_free", > > > > > > > "thp_zero_page_alloc", > > > > > > > "thp_zero_page_alloc_failed", > > > > > > > "thp_swpout", > > > > > > > -- > > > > > > > 2.32.0.554.ge1b32706d8-goog > > > > > > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] mm: free zapped tail pages when splitting isolated thp 2021-08-14 0:30 ` Yang Shi @ 2021-08-14 1:49 ` Yu Zhao 2021-08-14 2:34 ` Yang Shi 0 siblings, 1 reply; 21+ messages in thread From: Yu Zhao @ 2021-08-14 1:49 UTC (permalink / raw) To: Yang Shi Cc: Linux MM, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Zi Yan, Linux Kernel Mailing List, Shuang Zhai On Fri, Aug 13, 2021 at 6:30 PM Yang Shi <shy828301@gmail.com> wrote: > > On Fri, Aug 13, 2021 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote: > > > > () > > On Fri, Aug 13, 2021 at 5:24 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > On Wed, Aug 11, 2021 at 4:12 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > On Wed, Aug 11, 2021 at 4:25 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > On Sun, Aug 8, 2021 at 10:49 AM Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > > > > > On Wed, Aug 4, 2021 at 6:13 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > > > On Fri, Jul 30, 2021 at 11:39 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > > > > > > > > > If a tail page has only two references left, one inherited from the > > > > > > > > isolation of its head and the other from lru_add_page_tail() which we > > > > > > > > are about to drop, it means this tail page was concurrently zapped. > > > > > > > > Then we can safely free it and save page reclaim or migration the > > > > > > > > trouble of trying it. > > > > > > > > > > > > > > > > Signed-off-by: Yu Zhao <yuzhao@google.com> > > > > > > > > Tested-by: Shuang Zhai <zhais@google.com> > > > > > > > > --- > > > > > > > > include/linux/vm_event_item.h | 1 + > > > > > > > > mm/huge_memory.c | 28 ++++++++++++++++++++++++++++ > > > > > > > > mm/vmstat.c | 1 + > > > > > > > > 3 files changed, 30 insertions(+) > > > > > > > > > > > > > > > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > > > > > > > > index ae0dd1948c2b..829eeac84094 100644 > > > > > > > > --- a/include/linux/vm_event_item.h > > > > > > > > +++ b/include/linux/vm_event_item.h > > > > > > > > @@ -99,6 +99,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > > > > > > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > > > > > THP_SPLIT_PUD, > > > > > > > > #endif > > > > > > > > + THP_SPLIT_FREE, > > > > > > > > THP_ZERO_PAGE_ALLOC, > > > > > > > > THP_ZERO_PAGE_ALLOC_FAILED, > > > > > > > > THP_SWPOUT, > > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > > > > index d8b655856e79..5120478bca41 100644 > > > > > > > > --- a/mm/huge_memory.c > > > > > > > > +++ b/mm/huge_memory.c > > > > > > > > @@ -2432,6 +2432,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > > > > struct address_space *swap_cache = NULL; > > > > > > > > unsigned long offset = 0; > > > > > > > > unsigned int nr = thp_nr_pages(head); > > > > > > > > + LIST_HEAD(pages_to_free); > > > > > > > > + int nr_pages_to_free = 0; > > > > > > > > int i; > > > > > > > > > > > > > > > > VM_BUG_ON_PAGE(list && PageLRU(head), head); > > > > > > > > @@ -2506,6 +2508,25 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > > > > continue; > > > > > > > > unlock_page(subpage); > > > > > > > > > > > > > > > > + /* > > > > > > > > + * If a tail page has only two references left, one inherited > > > > > > > > + * from the isolation of its head and the other from > > > > > > > > + * lru_add_page_tail() which we are about to drop, it means this > > > > > > > > + * tail page was concurrently zapped. Then we can safely free it > > > > > > > > + * and save page reclaim or migration the trouble of trying it. > > > > > > > > + */ > > > > > > > > + if (list && page_ref_freeze(subpage, 2)) { > > > > > > > > + VM_BUG_ON_PAGE(PageLRU(subpage), subpage); > > > > > > > > + VM_BUG_ON_PAGE(PageCompound(subpage), subpage); > > > > > > > > + VM_BUG_ON_PAGE(page_mapped(subpage), subpage); > > > > > > > > + > > > > > > > > + ClearPageActive(subpage); > > > > > > > > + ClearPageUnevictable(subpage); > > > > > > > > + list_move(&subpage->lru, &pages_to_free); > > > > > > > > + nr_pages_to_free++; > > > > > > > > + continue; > > > > > > > > + } > > > > > > > > > > > > > > Yes, such page could be freed instead of swapping out. But I'm > > > > > > > wondering if we could have some simpler implementation. Since such > > > > > > > pages will be re-added to page list, so we should be able to check > > > > > > > their refcount in shrink_page_list(). If the refcount is 1, the > > > > > > > refcount inc'ed by lru_add_page_tail() has been put by later > > > > > > > put_page(), we know it is freed under us since the only refcount comes > > > > > > > from isolation, we could just jump to "keep" (the label in > > > > > > > shrink_page_list()), then such page will be freed later by > > > > > > > shrink_inactive_list(). > > > > > > > > > > > > > > For MADV_PAGEOUT, I think we could add some logic to handle such page > > > > > > > after shrink_page_list(), just like what shrink_inactive_list() does. > > > > > > > > > > > > > > Migration already handles refcount == 1 page, so should not need any change. > > > > > > > > > > > > > > Is this idea feasible? > > > > > > > > > > > > Yes, but then we would have to loop over the tail pages twice, here > > > > > > and in shrink_page_list(), right? > > > > > > > > > > I don't quite get what you mean "loop over the tail pages twice". Once > > > > > THP is isolated then get split, all the tail pages will be put on the > > > > > list (local list for isolated pages), then the reclaimer would deal > > > > > with the head page, then continue to iterate the list to deal with > > > > > tail pages. Your patch could free the tail pages earlier. But it > > > > > should not make too much difference to free the tail pages a little > > > > > bit later IMHO. > > > > > > > > We are in a (the first) loop here. If we free the tail pages later, > > > > then we will need to loop over them again (the second). > > > > > > > > IOW, > > > > 1) __split_huge_page(): for each of the 511 tail pages (first loop). > > > > 2) shrink_page_list(): for each of the 511 tail pages (second loop). > > > > > > > > > > In addition, if we try to freeze the refcount of a page in > > > > > > shrink_page_list(), we couldn't be certain whether this page used to > > > > > > be a tail page. So we would have to test every page. If a page wasn't > > > > > > a tail page, it's unlikely for its refcount to drop unless there is a > > > > > > race. But this patch isn't really intended to optimize such a race. > > > > > > It's mainly for the next, i.e., we know there is a good chance to drop > > > > > > tail pages (~10% on our systems). Sounds reasonable? Thanks. > > > > > > > > > > I'm not sure what is the main source of the partial mapped THPs from > > > > > your fleets. But if most of them are generated by MADV_DONTNEED (this > > > > > is used by some userspace memory allocator libs), they should be on > > > > > deferred split list too. Currently deferred split shrinker just > > > > > shrinks those THPs (simply split them and free unmapped sub pages) > > > > > proportionally, we definitely could shrink them more aggressively, for > > > > > example, by setting shrinker->seeks to 0. I'm wondering if this will > > > > > achieve a similar effect or not. > > > > > > > > Not partially mapped but internal fragmentation. > > > > > > > > IOW, some of the 4KB pages within a THP were never written into, which > > > > can be common depending on the implementations of userspace memory > > > > allocators. > > > > > > OK, this is actually what the patch #3 does. The patch #3 just doesn't > > > remap the "all zero" page when splitting the THP IIUC. But the page > > > has refcount from isolation so it can't be simply freed by put_page(). > > > > > > Actually this makes me think my suggestion is better. It doesn't make > > > too much sense to me to have page free logic (manipulate flags, > > > uncharge memcg, etc) in THP split. > > > > > > There have been a couple of places to handle such cases: > > > - deferred split shrinker: the unmapped subpage is just freed by > > > put_page() since there is no extra refcount > > > - migration: check page refcount then free the refcount == 1 one > > > > > > Here you add the third case in the page reclaim path, so why not just > > > let the page reclaim handle all the work for freeing page? > > > > As I have explained previously: > > > > 1) We would have to loop over tail pages twice. Not much overhead but > > unnecessary. > > 2) We would have to try to freeze the refcount on _every_ page in > > shrink_page_list() -- shrink_page_list() takes all pages, not just > > relevant ones (previously being tail). Attempting to freeze refcount > > on an irrelevant page will likely fail. Again, not a significant > > overhead but better to avoid. > > IIUC you don't need to freeze the refcount, such page is not in swap > cache, they don't have mapping. I'm supposed you just need simply do: Well, not really. There are speculatively page refcount increments and decrements. Specifically for what you just mentioned, those pages don't have owners anymore but GUP can't reach them. But those who use PFN to get pages can always reach them, e.g., compaction. And another similar but simpler example: static struct page *page_idle_get_page(unsigned long pfn) { struct page *page = pfn_to_online_page(pfn); if (!page || !PageLRU(page) || !get_page_unless_zero(page)) return NULL; if (unlikely(!PageLRU(page))) { put_page(page); page = NULL; } return page; } > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 403a175a720f..031b98627a02 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1547,6 +1547,9 @@ static unsigned int shrink_page_list(struct > list_head *page_list, > * Lazyfree page could be freed directly > */ > if (PageAnon(page) && PageSwapBacked(page)) { > + if (page_count(page) == 1) > + goto locked; > + > if (!PageSwapCache(page)) { > if (!(sc->gfp_mask & __GFP_IO)) > goto keep_locked; > > It is an unmapped anonymous page, nobody could see it other than > hwpoison handler AFAICT. This claim is false but the code works (if we change locked to keep_locked). When we are here, we have called 1) trylock_page() 2) page_check_references() -- _costly_ We have to call 1) unlock_page() 2) lock lru -- _costly_ 3) put_page_testzero() in move_pages_to_lru() 4) unlock lru before we reach mem_cgroup_uncharge_list() and free_unref_page_list(). These 6 extra steps are unnecessary. If we want to do it properly in shrink_page_list(), we should try to freeze the refcount of each page before step 1, and if successful, add this page to the free_pages list. But again, the two points I mentioned earlier are still valid. We do save a few lines of code though. > > I'm not against your idea. But I'd like to hear some clarifications > > about the points above. That is whether you think it's still a good > > idea to do what you suggested after taking these into account. > > I personally don't feel very comfortable to have the extra freeing > page logic in THP split when we could leverage page reclaim code with > acceptable overhead. And migration code already did so. I understand. And I agree that what you suggested is better for readability. I'm just listing things we may want to consider while deciding which option is more favorable. > > > > > I really don't have any objection to free such pages, but just > > > > > wondering if we could have something simpler or not. > > > > > > > > Thanks. > > > > > > > > > > > > + > > > > > > > > /* > > > > > > > > * Subpages may be freed if there wasn't any mapping > > > > > > > > * like if add_to_swap() is running on a lru page that > > > > > > > > @@ -2515,6 +2536,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > > > > */ > > > > > > > > put_page(subpage); > > > > > > > > } > > > > > > > > + > > > > > > > > + if (!nr_pages_to_free) > > > > > > > > + return; > > > > > > > > + > > > > > > > > + mem_cgroup_uncharge_list(&pages_to_free); > > > > > > > > + free_unref_page_list(&pages_to_free); > > > > > > > > + count_vm_events(THP_SPLIT_FREE, nr_pages_to_free); > > > > > > > > } > > > > > > > > > > > > > > > > int total_mapcount(struct page *page) > > > > > > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > > > > > > > index b0534e068166..f486e5d98d96 100644 > > > > > > > > --- a/mm/vmstat.c > > > > > > > > +++ b/mm/vmstat.c > > > > > > > > @@ -1300,6 +1300,7 @@ const char * const vmstat_text[] = { > > > > > > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > > > > > "thp_split_pud", > > > > > > > > #endif > > > > > > > > + "thp_split_free", > > > > > > > > "thp_zero_page_alloc", > > > > > > > > "thp_zero_page_alloc_failed", > > > > > > > > "thp_swpout", > > > > > > > > -- > > > > > > > > 2.32.0.554.ge1b32706d8-goog > > > > > > > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/3] mm: free zapped tail pages when splitting isolated thp 2021-08-14 1:49 ` Yu Zhao @ 2021-08-14 2:34 ` Yang Shi 0 siblings, 0 replies; 21+ messages in thread From: Yang Shi @ 2021-08-14 2:34 UTC (permalink / raw) To: Yu Zhao Cc: Linux MM, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Zi Yan, Linux Kernel Mailing List, Shuang Zhai On Fri, Aug 13, 2021 at 6:49 PM Yu Zhao <yuzhao@google.com> wrote: > > On Fri, Aug 13, 2021 at 6:30 PM Yang Shi <shy828301@gmail.com> wrote: > > > > On Fri, Aug 13, 2021 at 4:56 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > () > > > On Fri, Aug 13, 2021 at 5:24 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > On Wed, Aug 11, 2021 at 4:12 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > > > On Wed, Aug 11, 2021 at 4:25 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > On Sun, Aug 8, 2021 at 10:49 AM Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > > > > > > > On Wed, Aug 4, 2021 at 6:13 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > > > > > On Fri, Jul 30, 2021 at 11:39 PM Yu Zhao <yuzhao@google.com> wrote: > > > > > > > > > > > > > > > > > > If a tail page has only two references left, one inherited from the > > > > > > > > > isolation of its head and the other from lru_add_page_tail() which we > > > > > > > > > are about to drop, it means this tail page was concurrently zapped. > > > > > > > > > Then we can safely free it and save page reclaim or migration the > > > > > > > > > trouble of trying it. > > > > > > > > > > > > > > > > > > Signed-off-by: Yu Zhao <yuzhao@google.com> > > > > > > > > > Tested-by: Shuang Zhai <zhais@google.com> > > > > > > > > > --- > > > > > > > > > include/linux/vm_event_item.h | 1 + > > > > > > > > > mm/huge_memory.c | 28 ++++++++++++++++++++++++++++ > > > > > > > > > mm/vmstat.c | 1 + > > > > > > > > > 3 files changed, 30 insertions(+) > > > > > > > > > > > > > > > > > > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > > > > > > > > > index ae0dd1948c2b..829eeac84094 100644 > > > > > > > > > --- a/include/linux/vm_event_item.h > > > > > > > > > +++ b/include/linux/vm_event_item.h > > > > > > > > > @@ -99,6 +99,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > > > > > > > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > > > > > > THP_SPLIT_PUD, > > > > > > > > > #endif > > > > > > > > > + THP_SPLIT_FREE, > > > > > > > > > THP_ZERO_PAGE_ALLOC, > > > > > > > > > THP_ZERO_PAGE_ALLOC_FAILED, > > > > > > > > > THP_SWPOUT, > > > > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > > > > > > > index d8b655856e79..5120478bca41 100644 > > > > > > > > > --- a/mm/huge_memory.c > > > > > > > > > +++ b/mm/huge_memory.c > > > > > > > > > @@ -2432,6 +2432,8 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > > > > > struct address_space *swap_cache = NULL; > > > > > > > > > unsigned long offset = 0; > > > > > > > > > unsigned int nr = thp_nr_pages(head); > > > > > > > > > + LIST_HEAD(pages_to_free); > > > > > > > > > + int nr_pages_to_free = 0; > > > > > > > > > int i; > > > > > > > > > > > > > > > > > > VM_BUG_ON_PAGE(list && PageLRU(head), head); > > > > > > > > > @@ -2506,6 +2508,25 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > > > > > continue; > > > > > > > > > unlock_page(subpage); > > > > > > > > > > > > > > > > > > + /* > > > > > > > > > + * If a tail page has only two references left, one inherited > > > > > > > > > + * from the isolation of its head and the other from > > > > > > > > > + * lru_add_page_tail() which we are about to drop, it means this > > > > > > > > > + * tail page was concurrently zapped. Then we can safely free it > > > > > > > > > + * and save page reclaim or migration the trouble of trying it. > > > > > > > > > + */ > > > > > > > > > + if (list && page_ref_freeze(subpage, 2)) { > > > > > > > > > + VM_BUG_ON_PAGE(PageLRU(subpage), subpage); > > > > > > > > > + VM_BUG_ON_PAGE(PageCompound(subpage), subpage); > > > > > > > > > + VM_BUG_ON_PAGE(page_mapped(subpage), subpage); > > > > > > > > > + > > > > > > > > > + ClearPageActive(subpage); > > > > > > > > > + ClearPageUnevictable(subpage); > > > > > > > > > + list_move(&subpage->lru, &pages_to_free); > > > > > > > > > + nr_pages_to_free++; > > > > > > > > > + continue; > > > > > > > > > + } > > > > > > > > > > > > > > > > Yes, such page could be freed instead of swapping out. But I'm > > > > > > > > wondering if we could have some simpler implementation. Since such > > > > > > > > pages will be re-added to page list, so we should be able to check > > > > > > > > their refcount in shrink_page_list(). If the refcount is 1, the > > > > > > > > refcount inc'ed by lru_add_page_tail() has been put by later > > > > > > > > put_page(), we know it is freed under us since the only refcount comes > > > > > > > > from isolation, we could just jump to "keep" (the label in > > > > > > > > shrink_page_list()), then such page will be freed later by > > > > > > > > shrink_inactive_list(). > > > > > > > > > > > > > > > > For MADV_PAGEOUT, I think we could add some logic to handle such page > > > > > > > > after shrink_page_list(), just like what shrink_inactive_list() does. > > > > > > > > > > > > > > > > Migration already handles refcount == 1 page, so should not need any change. > > > > > > > > > > > > > > > > Is this idea feasible? > > > > > > > > > > > > > > Yes, but then we would have to loop over the tail pages twice, here > > > > > > > and in shrink_page_list(), right? > > > > > > > > > > > > I don't quite get what you mean "loop over the tail pages twice". Once > > > > > > THP is isolated then get split, all the tail pages will be put on the > > > > > > list (local list for isolated pages), then the reclaimer would deal > > > > > > with the head page, then continue to iterate the list to deal with > > > > > > tail pages. Your patch could free the tail pages earlier. But it > > > > > > should not make too much difference to free the tail pages a little > > > > > > bit later IMHO. > > > > > > > > > > We are in a (the first) loop here. If we free the tail pages later, > > > > > then we will need to loop over them again (the second). > > > > > > > > > > IOW, > > > > > 1) __split_huge_page(): for each of the 511 tail pages (first loop). > > > > > 2) shrink_page_list(): for each of the 511 tail pages (second loop). > > > > > > > > > > > > In addition, if we try to freeze the refcount of a page in > > > > > > > shrink_page_list(), we couldn't be certain whether this page used to > > > > > > > be a tail page. So we would have to test every page. If a page wasn't > > > > > > > a tail page, it's unlikely for its refcount to drop unless there is a > > > > > > > race. But this patch isn't really intended to optimize such a race. > > > > > > > It's mainly for the next, i.e., we know there is a good chance to drop > > > > > > > tail pages (~10% on our systems). Sounds reasonable? Thanks. > > > > > > > > > > > > I'm not sure what is the main source of the partial mapped THPs from > > > > > > your fleets. But if most of them are generated by MADV_DONTNEED (this > > > > > > is used by some userspace memory allocator libs), they should be on > > > > > > deferred split list too. Currently deferred split shrinker just > > > > > > shrinks those THPs (simply split them and free unmapped sub pages) > > > > > > proportionally, we definitely could shrink them more aggressively, for > > > > > > example, by setting shrinker->seeks to 0. I'm wondering if this will > > > > > > achieve a similar effect or not. > > > > > > > > > > Not partially mapped but internal fragmentation. > > > > > > > > > > IOW, some of the 4KB pages within a THP were never written into, which > > > > > can be common depending on the implementations of userspace memory > > > > > allocators. > > > > > > > > OK, this is actually what the patch #3 does. The patch #3 just doesn't > > > > remap the "all zero" page when splitting the THP IIUC. But the page > > > > has refcount from isolation so it can't be simply freed by put_page(). > > > > > > > > Actually this makes me think my suggestion is better. It doesn't make > > > > too much sense to me to have page free logic (manipulate flags, > > > > uncharge memcg, etc) in THP split. > > > > > > > > There have been a couple of places to handle such cases: > > > > - deferred split shrinker: the unmapped subpage is just freed by > > > > put_page() since there is no extra refcount > > > > - migration: check page refcount then free the refcount == 1 one > > > > > > > > Here you add the third case in the page reclaim path, so why not just > > > > let the page reclaim handle all the work for freeing page? > > > > > > As I have explained previously: > > > > > > 1) We would have to loop over tail pages twice. Not much overhead but > > > unnecessary. > > > 2) We would have to try to freeze the refcount on _every_ page in > > > shrink_page_list() -- shrink_page_list() takes all pages, not just > > > relevant ones (previously being tail). Attempting to freeze refcount > > > on an irrelevant page will likely fail. Again, not a significant > > > overhead but better to avoid. > > > > IIUC you don't need to freeze the refcount, such page is not in swap > > cache, they don't have mapping. I'm supposed you just need simply do: > > Well, not really. There are speculatively page refcount increments and > decrements. Specifically for what you just mentioned, those pages > don't have owners anymore but GUP can't reach them. But those who use > PFN to get pages can always reach them, e.g., compaction. And another > similar but simpler example: Yes, all the paths which traverse page via PFN could reach the page and may inc/dec refcount, some require the page on LRU (e.g. compaction), some don't (e.g. hwpoison). The page is off LRU and such refcount race should be harmless. > > static struct page *page_idle_get_page(unsigned long pfn) > { > struct page *page = pfn_to_online_page(pfn); > > if (!page || !PageLRU(page) || > !get_page_unless_zero(page)) > return NULL; > > if (unlikely(!PageLRU(page))) { > put_page(page); > page = NULL; > } > return page; > } > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 403a175a720f..031b98627a02 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1547,6 +1547,9 @@ static unsigned int shrink_page_list(struct > > list_head *page_list, > > * Lazyfree page could be freed directly > > */ > > if (PageAnon(page) && PageSwapBacked(page)) { > > + if (page_count(page) == 1) > > + goto locked; > > + > > if (!PageSwapCache(page)) { > > if (!(sc->gfp_mask & __GFP_IO)) > > goto keep_locked; > > > > It is an unmapped anonymous page, nobody could see it other than > > hwpoison handler AFAICT. > > This claim is false but the code works (if we change locked to keep_locked). Yeah, keep_locked. I thought keep_locked, but typed locked. > > When we are here, we have called > 1) trylock_page() > 2) page_check_references() -- _costly_ > > We have to call > 1) unlock_page() > 2) lock lru -- _costly_ > 3) put_page_testzero() in move_pages_to_lru() > 4) unlock lru > before we reach mem_cgroup_uncharge_list() and free_unref_page_list(). > > These 6 extra steps are unnecessary. If we want to do it properly in > shrink_page_list(), we should try to freeze the refcount of each page > before step 1, and if successful, add this page to the free_pages > list. But again, the two points I mentioned earlier are still valid. > We do save a few lines of code though. You definitely could move the check earlier or freeze the refcount. There are a couple of different ways, that example code is just for illustration. > > > > I'm not against your idea. But I'd like to hear some clarifications > > > about the points above. That is whether you think it's still a good > > > idea to do what you suggested after taking these into account. > > > > I personally don't feel very comfortable to have the extra freeing > > page logic in THP split when we could leverage page reclaim code with > > acceptable overhead. And migration code already did so. > > I understand. And I agree that what you suggested is better for > readability. I'm just listing things we may want to consider while > deciding which option is more favorable. > > > > > > > I really don't have any objection to free such pages, but just > > > > > > wondering if we could have something simpler or not. > > > > > > > > > > Thanks. > > > > > > > > > > > > > > + > > > > > > > > > /* > > > > > > > > > * Subpages may be freed if there wasn't any mapping > > > > > > > > > * like if add_to_swap() is running on a lru page that > > > > > > > > > @@ -2515,6 +2536,13 @@ static void __split_huge_page(struct page *page, struct list_head *list, > > > > > > > > > */ > > > > > > > > > put_page(subpage); > > > > > > > > > } > > > > > > > > > + > > > > > > > > > + if (!nr_pages_to_free) > > > > > > > > > + return; > > > > > > > > > + > > > > > > > > > + mem_cgroup_uncharge_list(&pages_to_free); > > > > > > > > > + free_unref_page_list(&pages_to_free); > > > > > > > > > + count_vm_events(THP_SPLIT_FREE, nr_pages_to_free); > > > > > > > > > } > > > > > > > > > > > > > > > > > > int total_mapcount(struct page *page) > > > > > > > > > diff --git a/mm/vmstat.c b/mm/vmstat.c > > > > > > > > > index b0534e068166..f486e5d98d96 100644 > > > > > > > > > --- a/mm/vmstat.c > > > > > > > > > +++ b/mm/vmstat.c > > > > > > > > > @@ -1300,6 +1300,7 @@ const char * const vmstat_text[] = { > > > > > > > > > #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD > > > > > > > > > "thp_split_pud", > > > > > > > > > #endif > > > > > > > > > + "thp_split_free", > > > > > > > > > "thp_zero_page_alloc", > > > > > > > > > "thp_zero_page_alloc_failed", > > > > > > > > > "thp_swpout", > > > > > > > > > -- > > > > > > > > > 2.32.0.554.ge1b32706d8-goog > > > > > > > > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 3/3] mm: don't remap clean subpages when splitting isolated thp 2021-07-31 6:39 [PATCH 0/3] mm: optimize thp for reclaim and migration Yu Zhao 2021-07-31 6:39 ` [PATCH 1/3] mm: don't take lru lock when splitting isolated thp Yu Zhao 2021-07-31 6:39 ` [PATCH 2/3] mm: free zapped tail pages " Yu Zhao @ 2021-07-31 6:39 ` Yu Zhao 2021-07-31 9:53 ` kernel test robot ` (4 more replies) 2 siblings, 5 replies; 21+ messages in thread From: Yu Zhao @ 2021-07-31 6:39 UTC (permalink / raw) To: linux-mm Cc: Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Yang Shi, Zi Yan, linux-kernel, Yu Zhao, Shuang Zhai Here being clean means containing only zeros and inaccessible to userspace. When splitting an isolated thp under reclaim or migration, there is no need to remap its clean subpages because they can be faulted in anew. Not remapping them avoids writeback or copying during reclaim or migration. This is particularly helpful when the internal fragmentation of a thp is high, i.e., it has many untouched subpages. Signed-off-by: Yu Zhao <yuzhao@google.com> Tested-by: Shuang Zhai <zhais@google.com> --- include/linux/rmap.h | 2 +- include/linux/vm_event_item.h | 1 + mm/huge_memory.c | 10 +++--- mm/migrate.c | 65 ++++++++++++++++++++++++++++++----- mm/vmstat.c | 1 + 5 files changed, 64 insertions(+), 15 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index c976cc6de257..4c2789d3fafb 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -243,7 +243,7 @@ int page_mkclean(struct page *); */ void page_mlock(struct page *page); -void remove_migration_ptes(struct page *old, struct page *new, bool locked); +void remove_migration_ptes(struct page *old, struct page *new, bool locked, bool unmap_clean); /* * Called by memory-failure.c to kill processes. diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 829eeac84094..a08fa70d8026 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -100,6 +100,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_SPLIT_PUD, #endif THP_SPLIT_FREE, + THP_SPLIT_UNMAP, THP_ZERO_PAGE_ALLOC, THP_ZERO_PAGE_ALLOC_FAILED, THP_SWPOUT, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5120478bca41..1653b84dc800 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2327,7 +2327,7 @@ static void unmap_page(struct page *page) VM_WARN_ON_ONCE_PAGE(page_mapped(page), page); } -static void remap_page(struct page *page, unsigned int nr) +static void remap_page(struct page *page, unsigned int nr, bool unmap_clean) { int i; @@ -2335,10 +2335,10 @@ static void remap_page(struct page *page, unsigned int nr) if (!PageAnon(page)) return; if (PageTransHuge(page)) { - remove_migration_ptes(page, page, true); + remove_migration_ptes(page, page, true, false); } else { for (i = 0; i < nr; i++) - remove_migration_ptes(page + i, page + i, true); + remove_migration_ptes(page + i, page + i, true, unmap_clean); } } @@ -2494,7 +2494,7 @@ static void __split_huge_page(struct page *page, struct list_head *list, } local_irq_enable(); - remap_page(head, nr); + remap_page(head, nr, !!list); if (PageSwapCache(head)) { swp_entry_t entry = { .val = page_private(head) }; @@ -2769,7 +2769,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) if (mapping) xa_unlock(&mapping->i_pages); local_irq_enable(); - remap_page(head, thp_nr_pages(head)); + remap_page(head, thp_nr_pages(head), false); ret = -EBUSY; } diff --git a/mm/migrate.c b/mm/migrate.c index 7e240437e7d9..46c2a54c2ac1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -168,14 +168,53 @@ void putback_movable_pages(struct list_head *l) } } +static bool try_to_unmap_clean(struct page_vma_mapped_walk *pvmw, struct page *page) +{ + void *addr; + bool dirty; + + VM_BUG_ON_PAGE(PageLRU(page), page); + VM_BUG_ON_PAGE(PageCompound(page), page); + VM_BUG_ON_PAGE(!PageAnon(page), page); + VM_BUG_ON_PAGE(!PageLocked(page), page); + VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page); + + if (PageMlocked(page) || (pvmw->vma->vm_flags & VM_LOCKED)) + return false; + + /* + * The pmd entry mapping the old thp was flushed and the pte mapping + * this subpage has been non present. Therefore, this subpage is + * inaccessible. We don't need to remap it if it contains only zeros. + */ + addr = kmap_atomic(page); + dirty = !!memchr_inv(addr, 0, PAGE_SIZE); + kunmap_atomic(addr); + + if (dirty) + return false; + + pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false); + dec_mm_counter(pvmw->vma->vm_mm, mm_counter(page)); + count_vm_event(THP_SPLIT_UNMAP); + + return true; +} + +struct rmap_walk_arg { + struct page *page; + bool unmap_clean; +}; + /* * Restore a potential migration pte to a working pte entry */ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, - unsigned long addr, void *old) + unsigned long addr, void *arg) { + struct rmap_walk_arg *rmap_walk_arg = arg; struct page_vma_mapped_walk pvmw = { - .page = old, + .page = rmap_walk_arg->page, .vma = vma, .address = addr, .flags = PVMW_SYNC | PVMW_MIGRATION, @@ -200,6 +239,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, continue; } #endif + if (rmap_walk_arg->unmap_clean && try_to_unmap_clean(&pvmw, new)) + continue; get_page(new); pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot))); @@ -267,13 +308,19 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, * Get rid of all migration entries and replace them by * references to the indicated page. */ -void remove_migration_ptes(struct page *old, struct page *new, bool locked) +void remove_migration_ptes(struct page *old, struct page *new, bool locked, bool unmap_clean) { + struct rmap_walk_arg rmap_walk_arg = { + .page = old, + .unmap_clean = unmap_clean, + }; struct rmap_walk_control rwc = { .rmap_one = remove_migration_pte, - .arg = old, + .arg = &rmap_walk_arg, }; + VM_BUG_ON_PAGE(unmap_clean && old != new, old); + if (locked) rmap_walk_locked(new, &rwc); else @@ -827,7 +874,7 @@ static int writeout(struct address_space *mapping, struct page *page) * At this point we know that the migration attempt cannot * be successful. */ - remove_migration_ptes(page, page, false); + remove_migration_ptes(page, page, false, false); rc = mapping->a_ops->writepage(page, &wbc); @@ -1070,7 +1117,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, if (page_was_mapped) remove_migration_ptes(page, - rc == MIGRATEPAGE_SUCCESS ? newpage : page, false); + rc == MIGRATEPAGE_SUCCESS ? newpage : page, false, false); out_unlock_both: unlock_page(newpage); @@ -1292,7 +1339,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, if (page_was_mapped) remove_migration_ptes(hpage, - rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, false); + rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, false, false); unlock_put_anon: unlock_page(new_hpage); @@ -2585,7 +2632,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate) if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE)) continue; - remove_migration_ptes(page, page, false); + remove_migration_ptes(page, page, false, false); migrate->src[i] = 0; unlock_page(page); @@ -2963,7 +3010,7 @@ void migrate_vma_finalize(struct migrate_vma *migrate) newpage = page; } - remove_migration_ptes(page, newpage, false); + remove_migration_ptes(page, newpage, false, false); unlock_page(page); if (is_zone_device_page(page)) diff --git a/mm/vmstat.c b/mm/vmstat.c index f486e5d98d96..a83cbb6a4d70 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1301,6 +1301,7 @@ const char * const vmstat_text[] = { "thp_split_pud", #endif "thp_split_free", + "thp_split_unmap", "thp_zero_page_alloc", "thp_zero_page_alloc_failed", "thp_swpout", -- 2.32.0.554.ge1b32706d8-goog ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] mm: don't remap clean subpages when splitting isolated thp 2021-07-31 6:39 ` [PATCH 3/3] mm: don't remap clean subpages " Yu Zhao @ 2021-07-31 9:53 ` kernel test robot 2021-07-31 15:45 ` kernel test robot ` (3 subsequent siblings) 4 siblings, 0 replies; 21+ messages in thread From: kernel test robot @ 2021-07-31 9:53 UTC (permalink / raw) To: Yu Zhao, linux-mm Cc: clang-built-linux, kbuild-all, Andrew Morton, Linux Memory Management List, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Yang Shi, Zi Yan, linux-kernel [-- Attachment #1: Type: text/plain, Size: 3311 bytes --] Hi Yu, Thank you for the patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v5.14-rc3] [cannot apply to hnaz-linux-mm/master linux/master next-20210730] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Yu-Zhao/mm-optimize-thp-for-reclaim-and-migration/20210731-144129 base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git c7d102232649226a69dddd58a4942cf13cff4f7c config: x86_64-randconfig-a001-20210730 (attached as .config) compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 4f71f59bf3d9914188a11d0c41bedbb339d36ff5) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/e4e76c4915b364558aacae2cf320a43306a20fa1 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Yu-Zhao/mm-optimize-thp-for-reclaim-and-migration/20210731-144129 git checkout e4e76c4915b364558aacae2cf320a43306a20fa1 # save the attached .config to linux build tree mkdir build_dir COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross O=build_dir ARCH=x86_64 SHELL=/bin/bash If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): >> mm/migrate.c:199:17: error: use of undeclared identifier 'THP_SPLIT_UNMAP' count_vm_event(THP_SPLIT_UNMAP); ^ mm/migrate.c:2606:16: warning: variable 'addr' set but not used [-Wunused-but-set-variable] unsigned long addr, i, restore = 0; ^ 1 warning and 1 error generated. vim +/THP_SPLIT_UNMAP +199 mm/migrate.c 170 171 static bool try_to_unmap_clean(struct page_vma_mapped_walk *pvmw, struct page *page) 172 { 173 void *addr; 174 bool dirty; 175 176 VM_BUG_ON_PAGE(PageLRU(page), page); 177 VM_BUG_ON_PAGE(PageCompound(page), page); 178 VM_BUG_ON_PAGE(!PageAnon(page), page); 179 VM_BUG_ON_PAGE(!PageLocked(page), page); 180 VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page); 181 182 if (PageMlocked(page) || (pvmw->vma->vm_flags & VM_LOCKED)) 183 return false; 184 185 /* 186 * The pmd entry mapping the old thp was flushed and the pte mapping 187 * this subpage has been non present. Therefore, this subpage is 188 * inaccessible. We don't need to remap it if it contains only zeros. 189 */ 190 addr = kmap_atomic(page); 191 dirty = !!memchr_inv(addr, 0, PAGE_SIZE); 192 kunmap_atomic(addr); 193 194 if (dirty) 195 return false; 196 197 pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false); 198 dec_mm_counter(pvmw->vma->vm_mm, mm_counter(page)); > 199 count_vm_event(THP_SPLIT_UNMAP); 200 201 return true; 202 } 203 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 34821 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] mm: don't remap clean subpages when splitting isolated thp 2021-07-31 6:39 ` [PATCH 3/3] mm: don't remap clean subpages " Yu Zhao 2021-07-31 9:53 ` kernel test robot @ 2021-07-31 15:45 ` kernel test robot 2021-08-03 11:25 ` Matthew Wilcox ` (2 subsequent siblings) 4 siblings, 0 replies; 21+ messages in thread From: kernel test robot @ 2021-07-31 15:45 UTC (permalink / raw) To: Yu Zhao, linux-mm Cc: kbuild-all, Andrew Morton, Linux Memory Management List, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Yang Shi, Zi Yan, linux-kernel [-- Attachment #1: Type: text/plain, Size: 3045 bytes --] Hi Yu, Thank you for the patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v5.14-rc3] [cannot apply to hnaz-linux-mm/master linux/master next-20210730] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Yu-Zhao/mm-optimize-thp-for-reclaim-and-migration/20210731-144129 base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git c7d102232649226a69dddd58a4942cf13cff4f7c config: i386-randconfig-a003-20210730 (attached as .config) compiler: gcc-10 (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0 reproduce (this is a W=1 build): # https://github.com/0day-ci/linux/commit/e4e76c4915b364558aacae2cf320a43306a20fa1 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Yu-Zhao/mm-optimize-thp-for-reclaim-and-migration/20210731-144129 git checkout e4e76c4915b364558aacae2cf320a43306a20fa1 # save the attached .config to linux build tree mkdir build_dir make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): mm/migrate.c: In function 'try_to_unmap_clean': >> mm/migrate.c:199:17: error: 'THP_SPLIT_UNMAP' undeclared (first use in this function) 199 | count_vm_event(THP_SPLIT_UNMAP); | ^~~~~~~~~~~~~~~ mm/migrate.c:199:17: note: each undeclared identifier is reported only once for each function it appears in vim +/THP_SPLIT_UNMAP +199 mm/migrate.c 170 171 static bool try_to_unmap_clean(struct page_vma_mapped_walk *pvmw, struct page *page) 172 { 173 void *addr; 174 bool dirty; 175 176 VM_BUG_ON_PAGE(PageLRU(page), page); 177 VM_BUG_ON_PAGE(PageCompound(page), page); 178 VM_BUG_ON_PAGE(!PageAnon(page), page); 179 VM_BUG_ON_PAGE(!PageLocked(page), page); 180 VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page); 181 182 if (PageMlocked(page) || (pvmw->vma->vm_flags & VM_LOCKED)) 183 return false; 184 185 /* 186 * The pmd entry mapping the old thp was flushed and the pte mapping 187 * this subpage has been non present. Therefore, this subpage is 188 * inaccessible. We don't need to remap it if it contains only zeros. 189 */ 190 addr = kmap_atomic(page); 191 dirty = !!memchr_inv(addr, 0, PAGE_SIZE); 192 kunmap_atomic(addr); 193 194 if (dirty) 195 return false; 196 197 pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false); 198 dec_mm_counter(pvmw->vma->vm_mm, mm_counter(page)); > 199 count_vm_event(THP_SPLIT_UNMAP); 200 201 return true; 202 } 203 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 33350 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] mm: don't remap clean subpages when splitting isolated thp 2021-07-31 6:39 ` [PATCH 3/3] mm: don't remap clean subpages " Yu Zhao 2021-07-31 9:53 ` kernel test robot 2021-07-31 15:45 ` kernel test robot @ 2021-08-03 11:25 ` Matthew Wilcox 2021-08-03 11:36 ` Matthew Wilcox 2021-08-04 14:27 ` Kirill A. Shutemov 4 siblings, 0 replies; 21+ messages in thread From: Matthew Wilcox @ 2021-08-03 11:25 UTC (permalink / raw) To: Yu Zhao Cc: linux-mm, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Vlastimil Babka, Yang Shi, Zi Yan, linux-kernel, Shuang Zhai On Sat, Jul 31, 2021 at 12:39:38AM -0600, Yu Zhao wrote: > +++ b/mm/migrate.c > @@ -168,14 +168,53 @@ void putback_movable_pages(struct list_head *l) > } > } > > +static bool try_to_unmap_clean(struct page_vma_mapped_walk *pvmw, struct page *page) > +{ > + void *addr; > + bool dirty; > + > + VM_BUG_ON_PAGE(PageLRU(page), page); > + VM_BUG_ON_PAGE(PageCompound(page), page); > + VM_BUG_ON_PAGE(!PageAnon(page), page); > + VM_BUG_ON_PAGE(!PageLocked(page), page); > + VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page); > + > + if (PageMlocked(page) || (pvmw->vma->vm_flags & VM_LOCKED)) > + return false; > + > + /* > + * The pmd entry mapping the old thp was flushed and the pte mapping > + * this subpage has been non present. Therefore, this subpage is > + * inaccessible. We don't need to remap it if it contains only zeros. > + */ > + addr = kmap_atomic(page); > + dirty = !!memchr_inv(addr, 0, PAGE_SIZE); > + kunmap_atomic(addr); kmap_local() is preferred now. Also, Linus has a particular hatred for the !! idiom; just compare against NULL. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] mm: don't remap clean subpages when splitting isolated thp 2021-07-31 6:39 ` [PATCH 3/3] mm: don't remap clean subpages " Yu Zhao ` (2 preceding siblings ...) 2021-08-03 11:25 ` Matthew Wilcox @ 2021-08-03 11:36 ` Matthew Wilcox 2021-08-08 17:21 ` Yu Zhao 2021-08-04 14:27 ` Kirill A. Shutemov 4 siblings, 1 reply; 21+ messages in thread From: Matthew Wilcox @ 2021-08-03 11:36 UTC (permalink / raw) To: Yu Zhao Cc: linux-mm, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Vlastimil Babka, Yang Shi, Zi Yan, linux-kernel, Shuang Zhai On Sat, Jul 31, 2021 at 12:39:38AM -0600, Yu Zhao wrote: > +++ b/include/linux/rmap.h > @@ -243,7 +243,7 @@ int page_mkclean(struct page *); > */ > void page_mlock(struct page *page); > > -void remove_migration_ptes(struct page *old, struct page *new, bool locked); > +void remove_migration_ptes(struct page *old, struct page *new, bool locked, bool unmap_clean); I'm not a big fan of 'bool, bool'. Could we use a flag word instead? #define MIGRATE_REMOVE_LOCKED 1 #define MIGRATE_UNMAP_CLEAN 2 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] mm: don't remap clean subpages when splitting isolated thp 2021-08-03 11:36 ` Matthew Wilcox @ 2021-08-08 17:21 ` Yu Zhao 0 siblings, 0 replies; 21+ messages in thread From: Yu Zhao @ 2021-08-08 17:21 UTC (permalink / raw) To: Matthew Wilcox Cc: Linux-MM, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Vlastimil Babka, Yang Shi, Zi Yan, linux-kernel, Shuang Zhai On Tue, Aug 3, 2021 at 5:38 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Sat, Jul 31, 2021 at 12:39:38AM -0600, Yu Zhao wrote: > > +++ b/include/linux/rmap.h > > @@ -243,7 +243,7 @@ int page_mkclean(struct page *); > > */ > > void page_mlock(struct page *page); > > > > -void remove_migration_ptes(struct page *old, struct page *new, bool locked); > > +void remove_migration_ptes(struct page *old, struct page *new, bool locked, bool unmap_clean); > > I'm not a big fan of 'bool, bool'. Could we use a flag word instead? > > #define MIGRATE_REMOVE_LOCKED 1 > #define MIGRATE_UNMAP_CLEAN 2 Will do. Thanks. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/3] mm: don't remap clean subpages when splitting isolated thp 2021-07-31 6:39 ` [PATCH 3/3] mm: don't remap clean subpages " Yu Zhao ` (3 preceding siblings ...) 2021-08-03 11:36 ` Matthew Wilcox @ 2021-08-04 14:27 ` Kirill A. Shutemov 4 siblings, 0 replies; 21+ messages in thread From: Kirill A. Shutemov @ 2021-08-04 14:27 UTC (permalink / raw) To: Yu Zhao Cc: linux-mm, Andrew Morton, Hugh Dickins, Kirill A . Shutemov, Matthew Wilcox, Vlastimil Babka, Yang Shi, Zi Yan, linux-kernel, Shuang Zhai On Sat, Jul 31, 2021 at 12:39:38AM -0600, Yu Zhao wrote: > @@ -267,13 +308,19 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, > * Get rid of all migration entries and replace them by > * references to the indicated page. > */ > -void remove_migration_ptes(struct page *old, struct page *new, bool locked) > +void remove_migration_ptes(struct page *old, struct page *new, bool locked, bool unmap_clean) > { > + struct rmap_walk_arg rmap_walk_arg = { > + .page = old, > + .unmap_clean = unmap_clean, > + }; > struct rmap_walk_control rwc = { > .rmap_one = remove_migration_pte, > - .arg = old, > + .arg = &rmap_walk_arg, > }; 'old' pointer has few low always-zero bit :P -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2021-08-14 2:34 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-07-31 6:39 [PATCH 0/3] mm: optimize thp for reclaim and migration Yu Zhao 2021-07-31 6:39 ` [PATCH 1/3] mm: don't take lru lock when splitting isolated thp Yu Zhao 2021-07-31 6:39 ` [PATCH 2/3] mm: free zapped tail pages " Yu Zhao 2021-08-04 14:22 ` Kirill A. Shutemov 2021-08-08 17:28 ` Yu Zhao 2021-08-05 0:13 ` Yang Shi 2021-08-08 17:49 ` Yu Zhao 2021-08-11 22:25 ` Yang Shi 2021-08-11 23:12 ` Yu Zhao 2021-08-13 23:24 ` Yang Shi 2021-08-13 23:56 ` Yu Zhao 2021-08-14 0:30 ` Yang Shi 2021-08-14 1:49 ` Yu Zhao 2021-08-14 2:34 ` Yang Shi 2021-07-31 6:39 ` [PATCH 3/3] mm: don't remap clean subpages " Yu Zhao 2021-07-31 9:53 ` kernel test robot 2021-07-31 15:45 ` kernel test robot 2021-08-03 11:25 ` Matthew Wilcox 2021-08-03 11:36 ` Matthew Wilcox 2021-08-08 17:21 ` Yu Zhao 2021-08-04 14:27 ` Kirill A. Shutemov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).