From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E085CC432C3 for ; Mon, 2 Dec 2019 23:07:41 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6448D206F0 for ; Mon, 2 Dec 2019 23:07:41 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6448D206F0 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EA7496B0003; Mon, 2 Dec 2019 18:07:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E58406B0006; Mon, 2 Dec 2019 18:07:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D6C656B0007; Mon, 2 Dec 2019 18:07:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0134.hostedemail.com [216.40.44.134]) by kanga.kvack.org (Postfix) with ESMTP id BE7DD6B0003 for ; Mon, 2 Dec 2019 18:07:40 -0500 (EST) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 58F6D181AEF1E for ; Mon, 2 Dec 2019 23:07:40 +0000 (UTC) X-FDA: 76221740280.20.kite37_896be4dc2c51 X-HE-Tag: kite37_896be4dc2c51 X-Filterd-Recvd-Size: 15941 Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by imf47.hostedemail.com (Postfix) with ESMTP for ; Mon, 2 Dec 2019 23:07:37 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R581e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07486;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0TjlHfCQ_1575328051; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TjlHfCQ_1575328051) by smtp.aliyun-inc.com(127.0.0.1); Tue, 03 Dec 2019 07:07:34 +0800 Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially To: Hugh Dickins Cc: "Kirill A. Shutemov" , kirill.shutemov@linux.intel.com, aarcange@redhat.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org References: <1574471132-55639-1-git-send-email-yang.shi@linux.alibaba.com> <20191125093611.hlamtyo4hvefwibi@box> <3a35da3a-dff0-a8ca-8269-3018fff8f21b@linux.alibaba.com> <20191125183350.5gmcln6t3ofszbsy@box> <9a68b929-2f84-083d-0ac8-2ceb3eab8785@linux.alibaba.com> <14b7c24b-706e-79cf-6fbc-f3c042f30f06@linux.alibaba.com> From: Yang Shi Message-ID: Date: Mon, 2 Dec 2019 15:07:25 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 11/27/19 7:06 PM, Hugh Dickins wrote: > On Tue, 26 Nov 2019, Yang Shi wrote: >> On 11/25/19 11:33 AM, Yang Shi wrote: >>> On 11/25/19 10:33 AM, Kirill A. Shutemov wrote: >>>> On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote: >>>>> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote: >>>>>> On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote: >>>>>>> Currently when truncating shmem file, if the range is partial of >>>>>>> THP >>>>>>> (start or end is in the middle of THP), the pages actually will >>>>>>> just get >>>>>>> cleared rather than being freed unless the range cover the whole >>>>>>> THP. >>>>>>> Even though all the subpages are truncated (randomly or >>>>>>> sequentially), >>>>>>> the THP may still be kept in page cache.=C2=A0 This might be fine= for >>>>>>> some >>>>>>> usecases which prefer preserving THP. >>>>>>> >>>>>>> But, when doing balloon inflation in QEMU, QEMU actually does hol= e >>>>>>> punch >>>>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is no= t >>>>>>> used. >>>>>>> So, when using shmem THP as memory backend QEMU inflation actuall= y >>>>>>> doesn't >>>>>>> work as expected since it doesn't free memory.=C2=A0 But, the inf= lation >>>>>>> usecase really needs get the memory freed.=C2=A0 Anonymous THP wi= ll not >>>>>>> get >>>>>>> freed right away too but it will be freed eventually when all >>>>>>> subpages are >>>>>>> unmapped, but shmem THP would still stay in page cache. >>>>>>> >>>>>>> To protect the usecases which may prefer preserving THP, introduc= e >>>>>>> a >>>>>>> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting T= HP >>>>>>> is >>>>>>> preferred behavior if truncating partial THP.=C2=A0 This mode jus= t makes >>>>>>> sense to tmpfs for the time being. > Sorry, I haven't managed to set aside enough time for this until now. > > First off, let me say that I firmly believe this punch-split behavior > should be the standard behavior (like in my huge tmpfs implementation), > and we should not need a special FALLOC_FL_SPLIT_HPAGE to do it. > But I don't know if I'll be able to persuade Kirill of that. > > If the caller wants to write zeroes into the file, she can do so with t= he > write syscall: the caller has asked to punch a hole or truncate the fil= e, > and in our case, like your QEMU case, hopes that memory and memcg charg= e > will be freed by doing so. I'll be surprised if changing the behavior > to yours and mine turns out to introduce a regression, but if it does, > I guess we'll then have to put it behind a sysctl or whatever. I'm not sure if there may be regression or not so I added the flag to=20 have a choice. But, sysctl may seem better. Anyway, I agree we don't=20 have to consider the potential regression until the real regression is=20 found. > > IIUC the reason that it's currently implemented by clearing the hole > is because split_huge_page() (unlike in older refcounting days) cannot > be guaranteed to succeed. Which is unfortunate, and none of us is very > keen to build a filesystem on unreliable behavior; but the failure case= s > appear in practice to be rare enough, that it's on balance better to gi= ve > the punch-hole-truncate caller what she asked for whenever possible. > >>>>>> We need to clarify interaction with khugepaged. This implementatio= n >>>>>> doesn't do anything to prevent khugepaged from collapsing the rang= e >>>>>> back >>>>>> to THP just after the split. >>>>> Yes, it doesn't. Will clarify this in the commit log. >>>> Okay, but I'm not sure that documention alone will be enough. We nee= d >>>> proper design. >>> Maybe we could try to hold inode lock with read during collapse_file(= ). The >>> shmem fallocate does acquire inode lock with write, this should be ab= le to >>> synchronize hole punch and khugepaged. And, shmem just needs hold ino= de >>> lock for llseek and fallocate, I'm supposed they are should be called= not >>> that frequently to have impact on khugepaged. The llseek might be oft= en, >>> but it should be quite fast. However, they might get blocked by khuge= paged. >>> >>> It sounds safe to hold a rwsem during collapsing THP. > No, I don't think we want to take any more locks while collapsing THP, > but that wasn't really the point. We're not concerned about a *race* > between splitting and khugepaged reassembling (I'm assuming that any > such race would already exist, and be well-guarded against by all the > refcount checks, punchhole not adding anything new here; but perhaps > I'm assuming too blithely, and it is worth checking over). > > The point, as I see it anyway, is the contradiction in effort: the > caller asks for hole to be punched, we do that, then a few seconds > or minutes later, khugepaged comes along and fills in the hole (if > huge page availability and memcg limit allow it). > > I agree that's not very satisfactory, but I think it's a side issue: > we don't have a good mechanism to tell khugepaged to keep off a range. > As it is, fallocate and ftruncate ought to do the job expected of them, > and khugepaged ought to do the job expected of it. And in many cases, > the punched file will not even be mapped (visible to khugepaged), or > max_ptes_none set to exclude working on such holes. > > Is khugepaged's action an issue for your QEMU case? So far not. > >>> Or we could set VM_NOHUGEPAGE in shmem inode's flag with hole punch a= nd >>> clear it after truncate, then check the flag before doing collapse in >>> khugepaged. khugepaged should not need hold the inode lock during col= lapse >>> since it could be released after the flag is checked. >> By relooking the code, it looks the latter one (check VM_NOHUGEPAGE) d= oesn't >> make sense, it can't prevent khugepaged from collapsing THP in paralle= l. >> >>>>>>> @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode >>>>>>> *inode, loff_t lstart, loff_t lend, >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 } >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 unlock_page(page); >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>>> +rescan_split: >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pag= evec_remove_exceptionals(&pvec); >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pag= evec_release(&pvec); >>>>>>> + >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (split && PageTran= sCompound(page)) { >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= /* The THP may get freed under us */ >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= if (!get_page_unless_zero(compound_head(page))) >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 goto rescan_out; >>>>>>> + >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= lock_page(page); >>>>>>> + >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= /* >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 * The extra pins from page cache lookup have been >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 * released by pagevec_release(). >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 */ >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= if (!split_huge_page(page)) { >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 unlock_page(page); >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 put_page(page); >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 /* Re-look up page cache from current index */ >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 goto again; >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= } >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= unlock_page(page); >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= put_page(page); >>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>>> +rescan_out: >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ind= ex++; >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>> Doing get_page_unless_zero() just after you've dropped the pin for >>>>>> the >>>>>> page looks very suboptimal. >>>>> If I don't drop the pins the THP can't be split. And, there might b= e >>>>> more >>>>> than one pins from find_get_entries() if I read the code correctly.= For >>>>> example, truncate 8K length in the middle of THP, the THP's refcoun= t >>>>> would >>>>> get bumpped twice since=C2=A0 two sub pages would be returned. >>>> Pin the page before pagevec_release() and avoid get_page_unless_zero= (). >>>> >>>> Current code is buggy. You need to check that the page is still belo= ng to >>>> the file after speculative lookup. > Yes indeed (I think you can even keep the page locked, but I may be wro= ng). > >>> Yes, I missed this point. Thanks for the suggestion. > The main problem I see is your "goto retry" and "goto again": > split_huge_page() may fail because a get_page() somewhere is holding > a transient reference to the page, or it may fail because there's a > GUP that holds a reference for days: you do not want to be stuck here > going round and around the loop waiting for that GUP to be released! I think my code already handled this case. Once the split is failed, it=20 just falls through to next index. It just does retry as long as split=20 succeeds. But my patch does clear before split. > > It's nice that we already have a trylock_page() loop followed by a > lock_page() loop. When split_huge_page() fails, the trylock_page() > loop can simply move on to the next page (skip over the compound page, > or retry it subpage by subpage? I've forgotten the pros and cons), > and leave final resolution to the later lock_page() loop: which has to > accept when split_huge_page() failed, and fall back to clearing instead= . > > I would prefer a smaller patch than your RFC: making split the > default behavior cuts out a lot of it, but I think there's still > more that can be cut. Here's the patch we've been using internally, > which deletes quite a lot of the old code; but you'll quickly notice > has a "Revisit later" hack in find_get_entries(), which I've not got > around to revisiting yet. Please blend what you can from my patch > into yours, or vice versa. Thank you very much. It looks your "Revisit later" hack tries to keep=20 just one extra pin for the THP so that split could succeed. I will try to blend the two patches. > > Hugh > > --- > > mm/filemap.c | 3 + > mm/shmem.c | 86 +++++++++++++++++-------------------------------- > 2 files changed, 34 insertions(+), 55 deletions(-) > > --- v5.4/mm/filemap.c 2019-11-24 16:32:01.000000000 -0800 > +++ linux/mm/filemap.c 2019-11-27 16:21:16.316801433 -0800 > @@ -1752,6 +1752,9 @@ unsigned find_get_entries(struct address > goto put_page; > page =3D find_subpage(page, xas.xa_index); > =20 > + /* Revisit later: make shmem_undo_range() easier for now */ > + if (PageTransCompound(page)) > + nr_entries =3D ret + 1; > export: > indices[ret] =3D xas.xa_index; > entries[ret] =3D page; > --- v5.4/mm/shmem.c 2019-11-24 16:32:01.000000000 -0800 > +++ linux/mm/shmem.c 2019-11-27 16:21:16.320801450 -0800 > @@ -788,6 +788,20 @@ void shmem_unlock_mapping(struct address > } > } > =20 > +static bool shmem_punch_compound(struct page *page, pgoff_t start, pgo= ff_t end) > +{ > + if (!PageTransCompound(page)) > + return true; > + > + /* Just proceed to delete a huge page wholly within the range punched= */ > + if (PageHead(page) && > + page->index >=3D start && page->index + HPAGE_PMD_NR <=3D end) > + return true; > + > + /* Try to split huge page, so we can truly punch the hole or truncate= */ > + return split_huge_page(page) >=3D 0; > +} > + > /* > * Remove range of pages and swap entries from page cache, and free t= hem. > * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fal= locate. > @@ -838,31 +852,11 @@ static void shmem_undo_range(struct inod > if (!trylock_page(page)) > continue; > =20 > - if (PageTransTail(page)) { > - /* Middle of THP: zero out the page */ > - clear_highpage(page); > - unlock_page(page); > - continue; > - } else if (PageTransHuge(page)) { > - if (index =3D=3D round_down(end, HPAGE_PMD_NR)) { > - /* > - * Range ends in the middle of THP: > - * zero out the page > - */ > - clear_highpage(page); > - unlock_page(page); > - continue; > - } > - index +=3D HPAGE_PMD_NR - 1; > - i +=3D HPAGE_PMD_NR - 1; > - } > - > - if (!unfalloc || !PageUptodate(page)) { > - VM_BUG_ON_PAGE(PageTail(page), page); > - if (page_mapping(page) =3D=3D mapping) { > - VM_BUG_ON_PAGE(PageWriteback(page), page); > + if ((!unfalloc || !PageUptodate(page)) && > + page_mapping(page) =3D=3D mapping) { > + VM_BUG_ON_PAGE(PageWriteback(page), page); > + if (shmem_punch_compound(page, start, end)) > truncate_inode_page(mapping, page); > - } > } > unlock_page(page); > } > @@ -936,43 +930,25 @@ static void shmem_undo_range(struct inod > =20 > lock_page(page); > =20 > - if (PageTransTail(page)) { > - /* Middle of THP: zero out the page */ > - clear_highpage(page); > - unlock_page(page); > - /* > - * Partial thp truncate due 'start' in middle > - * of THP: don't need to look on these pages > - * again on !pvec.nr restart. > - */ > - if (index !=3D round_down(end, HPAGE_PMD_NR)) > - start++; > - continue; > - } else if (PageTransHuge(page)) { > - if (index =3D=3D round_down(end, HPAGE_PMD_NR)) { > - /* > - * Range ends in the middle of THP: > - * zero out the page > - */ > - clear_highpage(page); > - unlock_page(page); > - continue; > - } > - index +=3D HPAGE_PMD_NR - 1; > - i +=3D HPAGE_PMD_NR - 1; > - } > - > if (!unfalloc || !PageUptodate(page)) { > - VM_BUG_ON_PAGE(PageTail(page), page); > - if (page_mapping(page) =3D=3D mapping) { > - VM_BUG_ON_PAGE(PageWriteback(page), page); > - truncate_inode_page(mapping, page); > - } else { > + if (page_mapping(page) !=3D mapping) { > /* Page was replaced by swap: retry */ > unlock_page(page); > index--; > break; > } > + VM_BUG_ON_PAGE(PageWriteback(page), page); > + if (shmem_punch_compound(page, start, end)) > + truncate_inode_page(mapping, page); > + else { > + /* Wipe the page and don't get stuck */ > + clear_highpage(page); > + flush_dcache_page(page); > + set_page_dirty(page); > + if (index < > + round_up(start, HPAGE_PMD_NR)) > + start =3D index + 1; > + } > } > unlock_page(page); > }