From: Waiman Long <longman@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] hugetlbfs: Disable IRQ when taking hugetlb_lock in set_max_huge_pages()
Date: Mon, 9 Dec 2019 21:37:00 -0500 [thread overview]
Message-ID: <1209d9ba-9d82-4dfc-5cdf-a2641814af75@redhat.com> (raw)
In-Reply-To: <a7ea9e1a-be9e-e6ee-5b30-602166041509@redhat.com>
On 12/9/19 7:46 PM, Waiman Long wrote:
> On 12/9/19 11:49 AM, Matthew Wilcox wrote:
>> On Mon, Dec 09, 2019 at 11:01:50AM -0500, Waiman Long wrote:
>>> [ 613.245273] Call Trace:
>>> [ 613.256273] <IRQ>
>>> [ 613.265273] dump_stack+0x9a/0xf0
>>> [ 613.281273] mark_lock+0xd0c/0x12f0
>>> [ 613.341273] __lock_acquire+0x146b/0x48c0
>>> [ 613.401273] lock_acquire+0x14f/0x3b0
>>> [ 613.440273] _raw_spin_lock+0x30/0x70
>>> [ 613.477273] free_huge_page+0x36f/0xaa0
>>> [ 613.495273] bio_check_pages_dirty+0x2fc/0x5c0
>> Oh, this is fun. So we kicked off IO to a hugepage, then truncated or
>> otherwise caused the page to come free. Then the IO finished and did the
>> final put on the page ... from interrupt context. Splat. Not something
>> that's going to happen often, but can happen if a process dies during
>> IO or due to a userspace bug.
>>
>> Maybe we should schedule work to do the actual freeing of the page
>> instead of this rather large patch. It doesn't seem like a case we need
>> to optimise for.
> I think that may be a good idea to try it out.
It turns out using workqueue is more complex that I originally thought.
I currently come up with the following untested changes to do that:
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac65bb5e38ac..629ac000318b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1136,7 +1136,7 @@ static inline void ClearPageHugeTemporary(struct
page *pag
page[2].mapping = NULL;
}
-void free_huge_page(struct page *page)
+static void __free_huge_page(struct page *page)
{
/*
* Can't pass hstate in here because it is called from the
@@ -1199,6 +1199,82 @@ void free_huge_page(struct page *page)
spin_unlock(&hugetlb_lock);
}
+/*
+ * As free_huge_page() can be called from softIRQ context, we have
+ * to defer the actual freeing in a workqueue to prevent potential
+ * hugetlb_lock deadlock.
+ */
+struct hpage_to_free {
+ struct page *page;
+ struct hpage_to_free *next;
+};
+static struct hpage_to_free *hpage_freelist;
+#define NEXT_PENDING ((struct hpage_to_free *)-1)
+
+/*
+ * This work function locklessly retrieves the pages to be freed and
+ * frees them one-by-one.
+ */
+static void free_huge_page_workfn(struct work_struct *work)
+{
+ struct hpage_to_free *curr, *next;
+
+recheck:
+ curr = xchg(&hpage_freelist, NULL);
+ if (!curr)
+ return;
+
+ while (curr) {
+ __free_huge_page(curr->page);
+ next = READ_ONCE(curr->next);
+ while (next == NEXT_PENDING) {
+ cpu_relax();
+ next = READ_ONCE(curr->next);
+ }
+ kfree(curr);
+ curr = next;
+ }
+ if (work && READ_ONCE(hpage_freelist))
+ goto recheck;
+}
+static DECLARE_WORK(free_huge_page_work, free_huge_page_workfn);
+
+void free_huge_page(struct page *page)
+{
+ /*
+ * Defer freeing in softIRQ context to avoid hugetlb_lock deadlock.
+ */
+ if (in_serving_softirq()) {
+ struct hpage_to_free *item, *next;
+
+ /*
+ * We are in serious trouble if kmalloc fails. In this
+ * case, we hope for the best and do the freeing now.
+ */
+ item = kmalloc(sizeof(*item), GFP_KERNEL);
+ if (WARN_ON_ONCE(!item))
+ goto free_page_now;
+
+ item->page = page;
+ item->next = NEXT_PENDING;
+ next = xchg(&hpage_freelist, item);
+ WRITE_ONCE(item->next, next);
+ schedule_work(&free_huge_page_work);
+ return;
+ }
+
+ /*
+ * Racing may prevent some of deferred huge pages in hpage_freelist
+ * from being freed. Check here and call schedule_work() if that
+ * is the case.
+ */
+ if (hpage_freelist && !work_pending(&free_huge_page_work))
+ schedule_work(&free_huge_page_work);
+
+free_page_now:
+ __free_huge_page(page);
+}
+
static void prep_new_huge_page(struct hstate *h, struct page *page, int
nid)
{
INIT_LIST_HEAD(&page->lru);
-----------------------------------------------------------------------------------------------------
I think it may be simpler and less risky to use spin_lock_bh() as you
have suggested. Also, the above code will not be good enough if more
lock taking functions are being called from softIRQ context in the future.
So what do you think?
Cheers,
Longman
Cheers,
Longman
prev parent reply other threads:[~2019-12-10 2:37 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-09 16:01 Waiman Long
2019-12-09 16:05 ` Waiman Long
2019-12-09 16:49 ` Matthew Wilcox
2019-12-10 0:46 ` Waiman Long
2019-12-10 2:37 ` Waiman Long [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1209d9ba-9d82-4dfc-5cdf-a2641814af75@redhat.com \
--to=longman@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox