From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=lXxK=ZY=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_SANE_1
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E085CC432C3
	for <linux-mm@archiver.kernel.org>; Mon,  2 Dec 2019 23:07:41 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 6448D206F0
	for <linux-mm@archiver.kernel.org>; Mon,  2 Dec 2019 23:07:41 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6448D206F0
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id EA7496B0003; Mon,  2 Dec 2019 18:07:40 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E58406B0006; Mon,  2 Dec 2019 18:07:40 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D6C656B0007; Mon,  2 Dec 2019 18:07:40 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0134.hostedemail.com [216.40.44.134])
	by kanga.kvack.org (Postfix) with ESMTP id BE7DD6B0003
	for <linux-mm@kvack.org>; Mon,  2 Dec 2019 18:07:40 -0500 (EST)
Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with SMTP id 58F6D181AEF1E
	for <linux-mm@kvack.org>; Mon,  2 Dec 2019 23:07:40 +0000 (UTC)
X-FDA: 76221740280.20.kite37_896be4dc2c51
X-HE-Tag: kite37_896be4dc2c51
X-Filterd-Recvd-Size: 15941
Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132])
	by imf47.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon,  2 Dec 2019 23:07:37 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R581e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07486;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0TjlHfCQ_1575328051;
Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TjlHfCQ_1575328051)
          by smtp.aliyun-inc.com(127.0.0.1);
          Tue, 03 Dec 2019 07:07:34 +0800
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP
 partially
To: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
 kirill.shutemov@linux.intel.com, aarcange@redhat.com,
 akpm@linux-foundation.org, linux-mm@kvack.org,
 linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
References: <1574471132-55639-1-git-send-email-yang.shi@linux.alibaba.com>
 <20191125093611.hlamtyo4hvefwibi@box>
 <3a35da3a-dff0-a8ca-8269-3018fff8f21b@linux.alibaba.com>
 <20191125183350.5gmcln6t3ofszbsy@box>
 <9a68b929-2f84-083d-0ac8-2ceb3eab8785@linux.alibaba.com>
 <14b7c24b-706e-79cf-6fbc-f3c042f30f06@linux.alibaba.com>
 <alpine.LSU.2.11.1911271718130.652@eggly.anvils>
From: Yang Shi <yang.shi@linux.alibaba.com>
Message-ID: <c687299a-574a-b35f-4ba4-ca67dc498ebd@linux.alibaba.com>
Date: Mon, 2 Dec 2019 15:07:25 -0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0)
 Gecko/20100101 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <alpine.LSU.2.11.1911271718130.652@eggly.anvils>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On 11/27/19 7:06 PM, Hugh Dickins wrote:
> On Tue, 26 Nov 2019, Yang Shi wrote:
>> On 11/25/19 11:33 AM, Yang Shi wrote:
>>> On 11/25/19 10:33 AM, Kirill A. Shutemov wrote:
>>>> On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
>>>>> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
>>>>>> On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
>>>>>>> Currently when truncating shmem file, if the range is partial of
>>>>>>> THP
>>>>>>> (start or end is in the middle of THP), the pages actually will
>>>>>>> just get
>>>>>>> cleared rather than being freed unless the range cover the whole
>>>>>>> THP.
>>>>>>> Even though all the subpages are truncated (randomly or
>>>>>>> sequentially),
>>>>>>> the THP may still be kept in page cache.=C2=A0 This might be fine=
 for
>>>>>>> some
>>>>>>> usecases which prefer preserving THP.
>>>>>>>
>>>>>>> But, when doing balloon inflation in QEMU, QEMU actually does hol=
e
>>>>>>> punch
>>>>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is no=
t
>>>>>>> used.
>>>>>>> So, when using shmem THP as memory backend QEMU inflation actuall=
y
>>>>>>> doesn't
>>>>>>> work as expected since it doesn't free memory.=C2=A0 But, the inf=
lation
>>>>>>> usecase really needs get the memory freed.=C2=A0 Anonymous THP wi=
ll not
>>>>>>> get
>>>>>>> freed right away too but it will be freed eventually when all
>>>>>>> subpages are
>>>>>>> unmapped, but shmem THP would still stay in page cache.
>>>>>>>
>>>>>>> To protect the usecases which may prefer preserving THP, introduc=
e
>>>>>>> a
>>>>>>> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting T=
HP
>>>>>>> is
>>>>>>> preferred behavior if truncating partial THP.=C2=A0 This mode jus=
t makes
>>>>>>> sense to tmpfs for the time being.
> Sorry, I haven't managed to set aside enough time for this until now.
>
> First off, let me say that I firmly believe this punch-split behavior
> should be the standard behavior (like in my huge tmpfs implementation),
> and we should not need a special FALLOC_FL_SPLIT_HPAGE to do it.
> But I don't know if I'll be able to persuade Kirill of that.
>
> If the caller wants to write zeroes into the file, she can do so with t=
he
> write syscall: the caller has asked to punch a hole or truncate the fil=
e,
> and in our case, like your QEMU case, hopes that memory and memcg charg=
e
> will be freed by doing so.  I'll be surprised if changing the behavior
> to yours and mine turns out to introduce a regression, but if it does,
> I guess we'll then have to put it behind a sysctl or whatever.

I'm not sure if there may be regression or not so I added the flag to=20
have a choice. But, sysctl may seem better. Anyway, I agree we don't=20
have to consider the potential regression until the real regression is=20
found.

>
> IIUC the reason that it's currently implemented by clearing the hole
> is because split_huge_page() (unlike in older refcounting days) cannot
> be guaranteed to succeed.  Which is unfortunate, and none of us is very
> keen to build a filesystem on unreliable behavior; but the failure case=
s
> appear in practice to be rare enough, that it's on balance better to gi=
ve
> the punch-hole-truncate caller what she asked for whenever possible.
>
>>>>>> We need to clarify interaction with khugepaged. This implementatio=
n
>>>>>> doesn't do anything to prevent khugepaged from collapsing the rang=
e
>>>>>> back
>>>>>> to THP just after the split.
>>>>> Yes, it doesn't. Will clarify this in the commit log.
>>>> Okay, but I'm not sure that documention alone will be enough. We nee=
d
>>>> proper design.
>>> Maybe we could try to hold inode lock with read during collapse_file(=
). The
>>> shmem fallocate does acquire inode lock with write, this should be ab=
le to
>>> synchronize hole punch and khugepaged. And, shmem just needs hold ino=
de
>>> lock for llseek and fallocate, I'm supposed they are should be called=
 not
>>> that frequently to have impact on khugepaged. The llseek might be oft=
en,
>>> but it should be quite fast. However, they might get blocked by khuge=
paged.
>>>
>>> It sounds safe to hold a rwsem during collapsing THP.
> No, I don't think we want to take any more locks while collapsing THP,
> but that wasn't really the point.  We're not concerned about a *race*
> between splitting and khugepaged reassembling (I'm assuming that any
> such race would already exist, and be well-guarded against by all the
> refcount checks, punchhole not adding anything new here; but perhaps
> I'm assuming too blithely, and it is worth checking over).
>
> The point, as I see it anyway, is the contradiction in effort: the
> caller asks for hole to be punched, we do that, then a few seconds
> or minutes later, khugepaged comes along and fills in the hole (if
> huge page availability and memcg limit allow it).
>
> I agree that's not very satisfactory, but I think it's a side issue:
> we don't have a good mechanism to tell khugepaged to keep off a range.
> As it is, fallocate and ftruncate ought to do the job expected of them,
> and khugepaged ought to do the job expected of it.  And in many cases,
> the punched file will not even be mapped (visible to khugepaged), or
> max_ptes_none set to exclude working on such holes.
>
> Is khugepaged's action an issue for your QEMU case?

So far not.

>
>>> Or we could set VM_NOHUGEPAGE in shmem inode's flag with hole punch a=
nd
>>> clear it after truncate, then check the flag before doing collapse in
>>> khugepaged. khugepaged should not need hold the inode lock during col=
lapse
>>> since it could be released after the flag is checked.
>> By relooking the code, it looks the latter one (check VM_NOHUGEPAGE) d=
oesn't
>> make sense, it can't prevent khugepaged from collapsing THP in paralle=
l.
>>
>>>>>>> @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode
>>>>>>> *inode, loff_t lstart, loff_t lend,
>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0 }
>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0 unlock_page(page);
>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }
>>>>>>> +rescan_split:
>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pag=
evec_remove_exceptionals(&pvec);
>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pag=
evec_release(&pvec);
>>>>>>> +
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (split && PageTran=
sCompound(page)) {
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 /* The THP may get freed under us */
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 if (!get_page_unless_zero(compound_head(page)))
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 goto rescan_out;
>>>>>>> +
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 lock_page(page);
>>>>>>> +
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 /*
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 * The extra pins from page cache lookup have been
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 * released by pagevec_release().
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0 */
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 if (!split_huge_page(page)) {
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 unlock_page(page);
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 put_page(page);
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 /* Re-look up page cache from current index */
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 goto again;
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 }
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 unlock_page(page);
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 put_page(page);
>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }
>>>>>>> +rescan_out:
>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ind=
ex++;
>>>>>>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 }
>>>>>> Doing get_page_unless_zero() just after you've dropped the pin for
>>>>>> the
>>>>>> page looks very suboptimal.
>>>>> If I don't drop the pins the THP can't be split. And, there might b=
e
>>>>> more
>>>>> than one pins from find_get_entries() if I read the code correctly.=
 For
>>>>> example, truncate 8K length in the middle of THP, the THP's refcoun=
t
>>>>> would
>>>>> get bumpped twice since=C2=A0 two sub pages would be returned.
>>>> Pin the page before pagevec_release() and avoid get_page_unless_zero=
().
>>>>
>>>> Current code is buggy. You need to check that the page is still belo=
ng to
>>>> the file after speculative lookup.
> Yes indeed (I think you can even keep the page locked, but I may be wro=
ng).
>
>>> Yes, I missed this point. Thanks for the suggestion.
> The main problem I see is your "goto retry" and "goto again":
> split_huge_page() may fail because a get_page() somewhere is holding
> a transient reference to the page, or it may fail because there's a
> GUP that holds a reference for days: you do not want to be stuck here
> going round and around the loop waiting for that GUP to be released!

I think my code already handled this case. Once the split is failed, it=20
just falls through to next index. It just does retry as long as split=20
succeeds.

But my patch does clear before split.

>
> It's nice that we already have a trylock_page() loop followed by a
> lock_page() loop.  When split_huge_page() fails, the trylock_page()
> loop can simply move on to the next page (skip over the compound page,
> or retry it subpage by subpage? I've forgotten the pros and cons),
> and leave final resolution to the later lock_page() loop: which has to
> accept when split_huge_page() failed, and fall back to clearing instead=
.
>
> I would prefer a smaller patch than your RFC: making split the
> default behavior cuts out a lot of it, but I think there's still
> more that can be cut.  Here's the patch we've been using internally,
> which deletes quite a lot of the old code; but you'll quickly notice
> has a "Revisit later" hack in find_get_entries(), which I've not got
> around to revisiting yet.  Please blend what you can from my patch
> into yours, or vice versa.

Thank you very much. It looks your "Revisit later" hack tries to keep=20
just one extra pin for the THP so that split could succeed.

I will try to blend the two patches.

>
> Hugh
>
> ---
>
>   mm/filemap.c |    3 +
>   mm/shmem.c   |   86 +++++++++++++++++--------------------------------
>   2 files changed, 34 insertions(+), 55 deletions(-)
>
> --- v5.4/mm/filemap.c	2019-11-24 16:32:01.000000000 -0800
> +++ linux/mm/filemap.c	2019-11-27 16:21:16.316801433 -0800
> @@ -1752,6 +1752,9 @@ unsigned find_get_entries(struct address
>   			goto put_page;
>   		page =3D find_subpage(page, xas.xa_index);
>  =20
> +		/* Revisit later: make shmem_undo_range() easier for now */
> +		if (PageTransCompound(page))
> +			nr_entries =3D ret + 1;
>   export:
>   		indices[ret] =3D xas.xa_index;
>   		entries[ret] =3D page;
> --- v5.4/mm/shmem.c	2019-11-24 16:32:01.000000000 -0800
> +++ linux/mm/shmem.c	2019-11-27 16:21:16.320801450 -0800
> @@ -788,6 +788,20 @@ void shmem_unlock_mapping(struct address
>   	}
>   }
>  =20
> +static bool shmem_punch_compound(struct page *page, pgoff_t start, pgo=
ff_t end)
> +{
> +	if (!PageTransCompound(page))
> +		return true;
> +
> +	/* Just proceed to delete a huge page wholly within the range punched=
 */
> +	if (PageHead(page) &&
> +	    page->index >=3D start && page->index + HPAGE_PMD_NR <=3D end)
> +		return true;
> +
> +	/* Try to split huge page, so we can truly punch the hole or truncate=
 */
> +	return split_huge_page(page) >=3D 0;
> +}
> +
>   /*
>    * Remove range of pages and swap entries from page cache, and free t=
hem.
>    * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fal=
locate.
> @@ -838,31 +852,11 @@ static void shmem_undo_range(struct inod
>   			if (!trylock_page(page))
>   				continue;
>  =20
> -			if (PageTransTail(page)) {
> -				/* Middle of THP: zero out the page */
> -				clear_highpage(page);
> -				unlock_page(page);
> -				continue;
> -			} else if (PageTransHuge(page)) {
> -				if (index =3D=3D round_down(end, HPAGE_PMD_NR)) {
> -					/*
> -					 * Range ends in the middle of THP:
> -					 * zero out the page
> -					 */
> -					clear_highpage(page);
> -					unlock_page(page);
> -					continue;
> -				}
> -				index +=3D HPAGE_PMD_NR - 1;
> -				i +=3D HPAGE_PMD_NR - 1;
> -			}
> -
> -			if (!unfalloc || !PageUptodate(page)) {
> -				VM_BUG_ON_PAGE(PageTail(page), page);
> -				if (page_mapping(page) =3D=3D mapping) {
> -					VM_BUG_ON_PAGE(PageWriteback(page), page);
> +			if ((!unfalloc || !PageUptodate(page)) &&
> +			    page_mapping(page) =3D=3D mapping) {
> +				VM_BUG_ON_PAGE(PageWriteback(page), page);
> +				if (shmem_punch_compound(page, start, end))
>   					truncate_inode_page(mapping, page);
> -				}
>   			}
>   			unlock_page(page);
>   		}
> @@ -936,43 +930,25 @@ static void shmem_undo_range(struct inod
>  =20
>   			lock_page(page);
>  =20
> -			if (PageTransTail(page)) {
> -				/* Middle of THP: zero out the page */
> -				clear_highpage(page);
> -				unlock_page(page);
> -				/*
> -				 * Partial thp truncate due 'start' in middle
> -				 * of THP: don't need to look on these pages
> -				 * again on !pvec.nr restart.
> -				 */
> -				if (index !=3D round_down(end, HPAGE_PMD_NR))
> -					start++;
> -				continue;
> -			} else if (PageTransHuge(page)) {
> -				if (index =3D=3D round_down(end, HPAGE_PMD_NR)) {
> -					/*
> -					 * Range ends in the middle of THP:
> -					 * zero out the page
> -					 */
> -					clear_highpage(page);
> -					unlock_page(page);
> -					continue;
> -				}
> -				index +=3D HPAGE_PMD_NR - 1;
> -				i +=3D HPAGE_PMD_NR - 1;
> -			}
> -
>   			if (!unfalloc || !PageUptodate(page)) {
> -				VM_BUG_ON_PAGE(PageTail(page), page);
> -				if (page_mapping(page) =3D=3D mapping) {
> -					VM_BUG_ON_PAGE(PageWriteback(page), page);
> -					truncate_inode_page(mapping, page);
> -				} else {
> +				if (page_mapping(page) !=3D mapping) {
>   					/* Page was replaced by swap: retry */
>   					unlock_page(page);
>   					index--;
>   					break;
>   				}
> +				VM_BUG_ON_PAGE(PageWriteback(page), page);
> +				if (shmem_punch_compound(page, start, end))
> +					truncate_inode_page(mapping, page);
> +				else {
> +					/* Wipe the page and don't get stuck */
> +					clear_highpage(page);
> +					flush_dcache_page(page);
> +					set_page_dirty(page);
> +					if (index <
> +					    round_up(start, HPAGE_PMD_NR))
> +						start =3D index + 1;
> +				}
>   			}
>   			unlock_page(page);
>   		}