Re: Hugepage program taking forever to exit

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes@cmpxchg.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: Linux-MM <linux-mm@kvack.org>, Yu Zhao <yuzhao@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Muchun Song <muchun.song@linux.dev>
Subject: Re: Hugepage program taking forever to exit
Date: Tue, 10 Sep 2024 15:33:42 -0400	[thread overview]
Message-ID: <20240910193342.GA108220@cmpxchg.org> (raw)
In-Reply-To: <02ffa542-ce49-4755-9d2b-29841f9973e0@kernel.dk>

On Tue, Sep 10, 2024 at 12:21:42PM -0600, Jens Axboe wrote:
> Hi,
> 
> Investigating another issue, I wrote the following simple program that allocates
> and faults in 500 1GB huge pages, and then registers them with io_uring. Each
> step is timed:
> 
> Got 500 huge pages (each 1024MB) in 0 msec
> Faulted in 500 huge pages in 38632 msec
> Registered 500 pages in 867 msec
> 
> and as expected, faulting in the pages takes (by far) the longest. From
> the above, you'd also expect the total runtime to be around ~39 seconds.
> But it is not... In fact it takes 82 seconds in total for this program
> to have exited. Looking at why, I see:
> 
> [<0>] __wait_rcu_gp+0x12b/0x160
> [<0>] synchronize_rcu_normal.part.0+0x2a/0x30
> [<0>] hugetlb_vmemmap_restore_folios+0x22/0xe0
> [<0>] update_and_free_pages_bulk+0x4c/0x220
> [<0>] return_unused_surplus_pages+0x80/0xa0
> [<0>] hugetlb_acct_memory.part.0+0x2dd/0x3b0
> [<0>] hugetlb_vm_op_close+0x160/0x180
> [<0>] remove_vma+0x20/0x60
> [<0>] exit_mmap+0x199/0x340
> [<0>] mmput+0x49/0x110
> [<0>] do_exit+0x261/0x9b0
> [<0>] do_group_exit+0x2c/0x80
> [<0>] __x64_sys_exit_group+0x14/0x20
> [<0>] x64_sys_call+0x714/0x720
> [<0>] do_syscall_64+0x5b/0x160
> [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53

Yeah, this looks wrong to me:

void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
{
	struct folio *folio;
	LIST_HEAD(vmemmap_pages);

	list_for_each_entry(folio, folio_list, lru) {
		int ret = hugetlb_vmemmap_split_folio(h, folio);

		/*
		 * Spliting the PMD requires allocating a page, thus lets fail
		 * early once we encounter the first OOM. No point in retrying
		 * as it can be dynamically done on remap with the memory
		 * we get back from the vmemmap deduplication.
		 */
		if (ret == -ENOMEM)
			break;
	}

	flush_tlb_all();

	/* avoid writes from page_ref_add_unless() while folding vmemmap */
	synchronize_rcu();

	list_for_each_entry(folio, folio_list, lru) {
		int ret;

		ret = __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
						       VMEMMAP_REMAP_NO_TLB_FLUSH);

		/*
		 * Pages to be freed may have been accumulated.  If we
		 * encounter an ENOMEM,  free what we have and try again.
		 * This can occur in the case that both spliting fails
		 * halfway and head page allocation also failed. In this
		 * case __hugetlb_vmemmap_optimize_folio() would free memory
		 * allowing more vmemmap remaps to occur.
		 */
		if (ret == -ENOMEM && !list_empty(&vmemmap_pages)) {
			flush_tlb_all();
			free_vmemmap_page_list(&vmemmap_pages);
			INIT_LIST_HEAD(&vmemmap_pages);
			__hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages,
							 VMEMMAP_REMAP_NO_TLB_FLUSH);
		}
	}

	flush_tlb_all();
	free_vmemmap_page_list(&vmemmap_pages);
}

If you don't have HVO enabled, then hugetlb_vmemmap_split_folio() does
nothing. And __hugetlb_vmemmap_optimize_folio() also does nothing,
leaving &vmemmap_pages empty and free_vmemmap_page_list() a nop.

So what's left is: it flushes the TLB twice and waits for RCU. What
for exactly?

The same is true for hugetlb_vmemmap_optimize_folio() and the
corresponding split function, which waits for RCU on every page being
allocated and freed, even if the vmemmap is left alone.

Surely all those RCU waits and tlb flushes should be guarded by
whether the HVO is actually enabled, no?

next prev parent reply	other threads:[~2024-09-10 19:33 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-10 18:21 Jens Axboe
2024-09-10 19:33 ` Johannes Weiner [this message]
2024-09-10 20:17 ` Yu Zhao
2024-09-10 23:08   ` Jens Axboe
2024-09-11  3:42     ` Andrew Morton
2024-09-11 13:22       ` Jens Axboe
2024-09-11 16:23         ` Yu Zhao
2024-09-11 18:38         ` Andrew Morton
2024-09-11 22:08           ` Yu Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240910193342.GA108220@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox