From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5D76C432C3 for ; Mon, 2 Dec 2019 23:15:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 935252070A for ; Mon, 2 Dec 2019 23:15:11 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 935252070A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5EF086B0003; Mon, 2 Dec 2019 18:15:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 59FD36B0006; Mon, 2 Dec 2019 18:15:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4D8416B0007; Mon, 2 Dec 2019 18:15:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0087.hostedemail.com [216.40.44.87]) by kanga.kvack.org (Postfix) with ESMTP id 31B746B0003 for ; Mon, 2 Dec 2019 18:15:11 -0500 (EST) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 96BF533CD for ; Mon, 2 Dec 2019 23:15:10 +0000 (UTC) X-FDA: 76221759180.01.ants31_4a268335cbf31 X-HE-Tag: ants31_4a268335cbf31 X-Filterd-Recvd-Size: 5842 Received: from out30-43.freemail.mail.aliyun.com (out30-43.freemail.mail.aliyun.com [115.124.30.43]) by imf08.hostedemail.com (Postfix) with ESMTP for ; Mon, 2 Dec 2019 23:15:09 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R941e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07488;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0TjlWD-8_1575328498; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TjlWD-8_1575328498) by smtp.aliyun-inc.com(127.0.0.1); Tue, 03 Dec 2019 07:15:06 +0800 Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially To: "Kirill A. Shutemov" , Hugh Dickins Cc: kirill.shutemov@linux.intel.com, aarcange@redhat.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org References: <1574471132-55639-1-git-send-email-yang.shi@linux.alibaba.com> <20191125093611.hlamtyo4hvefwibi@box> <3a35da3a-dff0-a8ca-8269-3018fff8f21b@linux.alibaba.com> <20191125183350.5gmcln6t3ofszbsy@box> <9a68b929-2f84-083d-0ac8-2ceb3eab8785@linux.alibaba.com> <14b7c24b-706e-79cf-6fbc-f3c042f30f06@linux.alibaba.com> <20191128113456.5phjhd3ajgky3h3i@box> From: Yang Shi Message-ID: Date: Mon, 2 Dec 2019 15:14:50 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20191128113456.5phjhd3ajgky3h3i@box> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 11/28/19 3:34 AM, Kirill A. Shutemov wrote: > On Wed, Nov 27, 2019 at 07:06:01PM -0800, Hugh Dickins wrote: >> On Tue, 26 Nov 2019, Yang Shi wrote: >>> On 11/25/19 11:33 AM, Yang Shi wrote: >>>> On 11/25/19 10:33 AM, Kirill A. Shutemov wrote: >>>>> On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote: >>>>>> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote: >>>>>>> On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote: >>>>>>>> Currently when truncating shmem file, if the range is partial of >>>>>>>> THP >>>>>>>> (start or end is in the middle of THP), the pages actually will >>>>>>>> just get >>>>>>>> cleared rather than being freed unless the range cover the whole >>>>>>>> THP. >>>>>>>> Even though all the subpages are truncated (randomly or >>>>>>>> sequentially), >>>>>>>> the THP may still be kept in page cache.=C2=A0 This might be fin= e for >>>>>>>> some >>>>>>>> usecases which prefer preserving THP. >>>>>>>> >>>>>>>> But, when doing balloon inflation in QEMU, QEMU actually does ho= le >>>>>>>> punch >>>>>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is n= ot >>>>>>>> used. >>>>>>>> So, when using shmem THP as memory backend QEMU inflation actual= ly >>>>>>>> doesn't >>>>>>>> work as expected since it doesn't free memory.=C2=A0 But, the in= flation >>>>>>>> usecase really needs get the memory freed.=C2=A0 Anonymous THP w= ill not >>>>>>>> get >>>>>>>> freed right away too but it will be freed eventually when all >>>>>>>> subpages are >>>>>>>> unmapped, but shmem THP would still stay in page cache. >>>>>>>> >>>>>>>> To protect the usecases which may prefer preserving THP, introdu= ce >>>>>>>> a >>>>>>>> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting = THP >>>>>>>> is >>>>>>>> preferred behavior if truncating partial THP.=C2=A0 This mode ju= st makes >>>>>>>> sense to tmpfs for the time being. >> Sorry, I haven't managed to set aside enough time for this until now. >> >> First off, let me say that I firmly believe this punch-split behavior >> should be the standard behavior (like in my huge tmpfs implementation)= , >> and we should not need a special FALLOC_FL_SPLIT_HPAGE to do it. >> But I don't know if I'll be able to persuade Kirill of that. >> >> If the caller wants to write zeroes into the file, she can do so with = the >> write syscall: the caller has asked to punch a hole or truncate the fi= le, >> and in our case, like your QEMU case, hopes that memory and memcg char= ge >> will be freed by doing so. I'll be surprised if changing the behavior >> to yours and mine turns out to introduce a regression, but if it does, >> I guess we'll then have to put it behind a sysctl or whatever. >> >> IIUC the reason that it's currently implemented by clearing the hole >> is because split_huge_page() (unlike in older refcounting days) cannot >> be guaranteed to succeed. Which is unfortunate, and none of us is ver= y >> keen to build a filesystem on unreliable behavior; but the failure cas= es >> appear in practice to be rare enough, that it's on balance better to g= ive >> the punch-hole-truncate caller what she asked for whenever possible. > I don't have a firm position here. Maybe you are right and we should tr= y > to split pages right away. > > It might be useful to consider case wider than shmem. > > On traditional filesystem with a backing storage semantics of the same > punch hole operation is somewhat different. It doesn't have explicit > implications on memory footprint. It's about managing persistent storag= e. > With shmem/tmpfs it is lumped together. > > It might be nice to write down pages that can be discarded under memory > pressure and leave the huge page intact until then... Sounds like another deferred split queue. It could be an option, but our=20 usecase needs get memory freed right away since the memory might be=20 reused by others very soon. > > [ I don't see a problem with your patch as long as we agree that it's > desired semantics for the interface. ] >