From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=4Ufl=ZR=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_SANE_1
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 38CE8C43215
	for <linux-mm@archiver.kernel.org>; Mon, 25 Nov 2019 19:33:55 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 9DED720748
	for <linux-mm@archiver.kernel.org>; Mon, 25 Nov 2019 19:33:54 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9DED720748
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 1C66C6B027A; Mon, 25 Nov 2019 14:33:54 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 177636B027B; Mon, 25 Nov 2019 14:33:54 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 08DA36B027C; Mon, 25 Nov 2019 14:33:54 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0144.hostedemail.com [216.40.44.144])
	by kanga.kvack.org (Postfix) with ESMTP id E30B16B027A
	for <linux-mm@kvack.org>; Mon, 25 Nov 2019 14:33:53 -0500 (EST)
Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with SMTP id 7CDA65837
	for <linux-mm@kvack.org>; Mon, 25 Nov 2019 19:33:53 +0000 (UTC)
X-FDA: 76195799946.14.care21_3eeabb583992a
X-HE-Tag: care21_3eeabb583992a
X-Filterd-Recvd-Size: 5804
Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132])
	by imf17.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 25 Nov 2019 19:33:52 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R601e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04420;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0Tj5FL5p_1574710425;
Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0Tj5FL5p_1574710425)
          by smtp.aliyun-inc.com(127.0.0.1);
          Tue, 26 Nov 2019 03:33:49 +0800
Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP
 partially
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: hughd@google.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com,
 akpm@linux-foundation.org, linux-mm@kvack.org,
 linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
References: <1574471132-55639-1-git-send-email-yang.shi@linux.alibaba.com>
 <20191125093611.hlamtyo4hvefwibi@box>
 <3a35da3a-dff0-a8ca-8269-3018fff8f21b@linux.alibaba.com>
 <20191125183350.5gmcln6t3ofszbsy@box>
From: Yang Shi <yang.shi@linux.alibaba.com>
Message-ID: <9a68b929-2f84-083d-0ac8-2ceb3eab8785@linux.alibaba.com>
Date: Mon, 25 Nov 2019 11:33:41 -0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0)
 Gecko/20100101 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <20191125183350.5gmcln6t3ofszbsy@box>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On 11/25/19 10:33 AM, Kirill A. Shutemov wrote:
> On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote:
>>
>> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote:
>>> On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote:
>>>> Currently when truncating shmem file, if the range is partial of THP
>>>> (start or end is in the middle of THP), the pages actually will just=
 get
>>>> cleared rather than being freed unless the range cover the whole THP=
.
>>>> Even though all the subpages are truncated (randomly or sequentially=
),
>>>> the THP may still be kept in page cache.  This might be fine for som=
e
>>>> usecases which prefer preserving THP.
>>>>
>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole p=
unch
>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not u=
sed.
>>>> So, when using shmem THP as memory backend QEMU inflation actually d=
oesn't
>>>> work as expected since it doesn't free memory.  But, the inflation
>>>> usecase really needs get the memory freed.  Anonymous THP will not g=
et
>>>> freed right away too but it will be freed eventually when all subpag=
es are
>>>> unmapped, but shmem THP would still stay in page cache.
>>>>
>>>> To protect the usecases which may prefer preserving THP, introduce a
>>>> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP =
is
>>>> preferred behavior if truncating partial THP.  This mode just makes
>>>> sense to tmpfs for the time being.
>>> We need to clarify interaction with khugepaged. This implementation
>>> doesn't do anything to prevent khugepaged from collapsing the range b=
ack
>>> to THP just after the split.
>> Yes, it doesn't. Will clarify this in the commit log.
> Okay, but I'm not sure that documention alone will be enough. We need
> proper design.

Maybe we could try to hold inode lock with read during collapse_file().=20
The shmem fallocate does acquire inode lock with write, this should be=20
able to synchronize hole punch and khugepaged. And, shmem just needs=20
hold inode lock for llseek and fallocate, I'm supposed they are should=20
be called not that frequently to have impact on khugepaged. The llseek=20
might be often, but it should be quite fast. However, they might get=20
blocked by khugepaged.

It sounds safe to hold a rwsem during collapsing THP.

Or we could set VM_NOHUGEPAGE in shmem inode's flag with hole punch and=20
clear it after truncate, then check the flag before doing collapse in=20
khugepaged. khugepaged should not need hold the inode lock during=20
collapse since it could be released after the flag is checked.

>
>>>> @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode *ino=
de, loff_t lstart, loff_t lend,
>>>>    			}
>>>>    			unlock_page(page);
>>>>    		}
>>>> +rescan_split:
>>>>    		pagevec_remove_exceptionals(&pvec);
>>>>    		pagevec_release(&pvec);
>>>> +
>>>> +		if (split && PageTransCompound(page)) {
>>>> +			/* The THP may get freed under us */
>>>> +			if (!get_page_unless_zero(compound_head(page)))
>>>> +				goto rescan_out;
>>>> +
>>>> +			lock_page(page);
>>>> +
>>>> +			/*
>>>> +			 * The extra pins from page cache lookup have been
>>>> +			 * released by pagevec_release().
>>>> +			 */
>>>> +			if (!split_huge_page(page)) {
>>>> +				unlock_page(page);
>>>> +				put_page(page);
>>>> +				/* Re-look up page cache from current index */
>>>> +				goto again;
>>>> +			}
>>>> +			unlock_page(page);
>>>> +			put_page(page);
>>>> +		}
>>>> +rescan_out:
>>>>    		index++;
>>>>    	}
>>> Doing get_page_unless_zero() just after you've dropped the pin for th=
e
>>> page looks very suboptimal.
>> If I don't drop the pins the THP can't be split. And, there might be m=
ore
>> than one pins from find_get_entries() if I read the code correctly. Fo=
r
>> example, truncate 8K length in the middle of THP, the THP's refcount w=
ould
>> get bumpped twice since=C2=A0 two sub pages would be returned.
> Pin the page before pagevec_release() and avoid get_page_unless_zero().
>
> Current code is buggy. You need to check that the page is still belong =
to
> the file after speculative lookup.

Yes, I missed this point. Thanks for the suggestion.

>