From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=RKFY=2G=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DEBFAC43603
	for <linux-mm@archiver.kernel.org>; Mon, 16 Dec 2019 21:51:13 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 83A84207FF
	for <linux-mm@archiver.kernel.org>; Mon, 16 Dec 2019 21:51:13 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="pIXhRRKh"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 83A84207FF
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 0366E8E0018; Mon, 16 Dec 2019 16:51:13 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id F285C8E0003; Mon, 16 Dec 2019 16:51:12 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E17C88E0018; Mon, 16 Dec 2019 16:51:12 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0164.hostedemail.com [216.40.44.164])
	by kanga.kvack.org (Postfix) with ESMTP id C89998E0003
	for <linux-mm@kvack.org>; Mon, 16 Dec 2019 16:51:12 -0500 (EST)
Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with SMTP id 85455180AD80F
	for <linux-mm@kvack.org>; Mon, 16 Dec 2019 21:51:12 +0000 (UTC)
X-FDA: 76272350784.16.frog47_a7cd99b0fd35
X-HE-Tag: frog47_a7cd99b0fd35
X-Filterd-Recvd-Size: 5296
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by imf46.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 16 Dec 2019 21:51:11 +0000 (UTC)
Received: from akpm3.svl.corp.google.com (unknown [104.133.8.65])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPSA id B65DF2072B;
	Mon, 16 Dec 2019 21:51:10 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=default; t=1576533071;
	bh=dTlq4UrRX+WzctKsCGMfNSVDjj96gEsT6b8myF4j8us=;
	h=Date:From:To:Cc:Subject:In-Reply-To:References:From;
	b=pIXhRRKhZvjN4Q8ibxVdfOsYEd1SSg2tUVXFVwP9hWciLx5eMY9MFi15CkDooJhW9
	 aZUJdlIFmXLYKRDRzufoD4Kdxs7BRnRW+kiIrg1VLdZ0FKMUTfzQejSAXs2J6TggW/
	 lGprYXeE6ukLhccdsBMF5YB5FbIcmMxczjDy3CKQ=
Date: Mon, 16 Dec 2019 13:51:10 -0800
From: Andrew Morton <akpm@linux-foundation.org>
To: Waiman Long <longman@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>, linux-kernel@vger.kernel.org,
 linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>, Davidlohr Bueso
 <dave@stgolabs.net>, Andi Kleen <ak@linux.intel.com>, Michal Hocko
 <mhocko@kernel.org>
Subject: Re: [PATCH] mm/hugetlb: Defer freeing of huge pages if in non-task
 context
Message-Id: <20191216135110.b6fb283ba5551c8cfb22494e@linux-foundation.org>
In-Reply-To: <20191216182739.26880-1-longman@redhat.com>
References: <20191216182739.26880-1-longman@redhat.com>
X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.32; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, 16 Dec 2019 13:27:39 -0500 Waiman Long <longman@redhat.com> wrote:

> The following lockdep splat was observed when a certain hugetlbfs test
> was run:
> 
> ...
> 
> Both the hugetbl_lock and the subpool lock can be acquired in
> free_huge_page(). One way to solve the problem is to make both locks
> irq-safe. Another alternative is to defer the freeing to a workqueue job.
> 
> This patch implements the deferred freeing by adding a
> free_hpage_workfn() work function to do the actual freeing. The
> free_huge_page() call in a non-task context saves the page to be freed
> in the hpage_freelist linked list in a lockless manner.
> 
> The generic workqueue is used to process the work, but a dedicated
> workqueue can be used instead if it is desirable to have the huge page
> freed ASAP.
>
> ...
>
> @@ -1199,6 +1199,73 @@ void free_huge_page(struct page *page)
>  	spin_unlock(&hugetlb_lock);
>  }
>  
> +/*
> + * As free_huge_page() can be called from a non-task context, we have
> + * to defer the actual freeing in a workqueue to prevent potential
> + * hugetlb_lock deadlock.
> + *
> + * free_hpage_workfn() locklessly retrieves the linked list of pages to
> + * be freed and frees them one-by-one. As the page->mapping pointer is
> + * going to be cleared in __free_huge_page() anyway, it is reused as the
> + * next pointer of a singly linked list of huge pages to be freed.
> + */
> +#define NEXT_PENDING	((struct page *)-1)
> +static struct page *hpage_freelist;
> +
> +static void free_hpage_workfn(struct work_struct *work)
> +{
> +	struct page *curr, *next;
> +	int cnt = 0;
> +
> +	do {
> +		curr = xchg(&hpage_freelist, NULL);
> +		if (!curr)
> +			break;
> +
> +		while (curr) {
> +			next = (struct page *)READ_ONCE(curr->mapping);
> +			if (next == NEXT_PENDING) {
> +				cpu_relax();
> +				continue;
> +			}
> +			__free_huge_page(curr);
> +			curr = next;
> +			cnt++;
> +		}
> +	} while (!READ_ONCE(hpage_freelist));
> +
> +	if (!cnt)
> +		return;
> +	pr_debug("HugeTLB: free_hpage_workfn() frees %d huge page(s)\n", cnt);
> +}
> +static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> +
> +void free_huge_page(struct page *page)
> +{
> +	/*
> +	 * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> +	 */
> +	if (!in_task()) {
> +		struct page *next;
> +
> +		page->mapping = (struct address_space *)NEXT_PENDING;
> +		next = xchg(&hpage_freelist, page);
> +		WRITE_ONCE(page->mapping, (struct address_space *)next);

The NEXT_PENDING stuff could do with come commenting, I think.  It's
reasonably obvious, but not obvious enough.  For example, why does the
second write to page->mapping use WRITE_ONCE() but the first does not. 
Please spell out the design, fully.

> +		schedule_work(&free_hpage_work);
> +		return;
> +	}
> +
> +	/*
> +	 * Racing may prevent some deferred huge pages in hpage_freelist
> +	 * from being freed. Check here and call schedule_work() if that
> +	 * is the case.
> +	 */
> +	if (unlikely(hpage_freelist && !work_pending(&free_hpage_work)))
> +		schedule_work(&free_hpage_work);
> +
> +	__free_huge_page(page);
> +}
> +
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>  {
>  	INIT_LIST_HEAD(&page->lru);

Otherwise it looks OK to me.  Deferring freeing in this way is
generally lame and gives rise to concerns about memory exhaustion in
strange situations, and to concerns about various memory accounting
stats being logically wrong for short periods.  But we already do this
in (too) many places, so fingers crossed :(