From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1958AC4345F for ; Wed, 24 Apr 2024 09:57:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 72C806B00B8; Wed, 24 Apr 2024 05:57:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6DC696B00BA; Wed, 24 Apr 2024 05:57:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5CBF36B00BB; Wed, 24 Apr 2024 05:57:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 3F7B86B00B8 for ; Wed, 24 Apr 2024 05:57:13 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id DA6B11208F3 for ; Wed, 24 Apr 2024 09:57:12 +0000 (UTC) X-FDA: 82043972304.12.1FB9E3C Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf16.hostedemail.com (Postfix) with ESMTP id F2943180011 for ; Wed, 24 Apr 2024 09:57:10 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf16.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713952631; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fJTvnlB9ejnAagY8Xby8I0am9yF+fYS94TNZN7lRkMM=; b=1zDd2Z6i+OyjA6wNZwM6JroPERgSsuVXoHt13+CnpuRzOtdakj4EMxKkBFmpwUqSBxp4nO 8BJnzUBuI+e37zpkXONYnHLNcnX4d6feCbhIsQi+BRE+dIylmKUjypK8DEQJsUyI8Pubbz anKMN8N2iAEIwce2O0xLJ6QPrr/8PWU= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf16.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713952631; a=rsa-sha256; cv=none; b=Ggfz0o/2pVsOi+JgOR3Y75Ku3+pn9L4NYQmPhMPenDw5t/V0daOhJF7vkVA3kXAUohhC53 hymkbGwpmys7VxJb6ls2L/Cu2oyTdTw28hLxk4orrjJ0NGDXXcx5to46m8oBRFXG7U6tqQ YNoF/K+5eZ46njBzqBxOwymmwc4ll9A= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 8B02D339; Wed, 24 Apr 2024 02:57:37 -0700 (PDT) Received: from [10.1.25.156] (unknown [10.1.25.156]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6C4A93F73F; Wed, 24 Apr 2024 02:57:08 -0700 (PDT) Message-ID: <7ae62c3f-918a-4778-badb-8f7ca74328d8@arm.com> Date: Wed, 24 Apr 2024 10:57:07 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 1/5] mm: memory: extend finish_fault() to support large folio Content-Language: en-GB To: Baolin Wang , akpm@linux-foundation.org, hughd@google.com Cc: willy@infradead.org, david@redhat.com, wangkefeng.wang@huawei.com, 21cnbao@gmail.com, ying.huang@intel.com, shy828301@gmail.com, ziy@nvidia.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <358aefb1858b63164894d7d8504f3dae0b495366.1713755580.git.baolin.wang@linux.alibaba.com> <6aa25e2a-a6b6-4ab7-8300-053ca3c0d748@arm.com> <6c418d70-a75d-4019-a0f5-56a61002d37a@arm.com> <5f949c1f-c56e-4227-af60-05a2a19f4c2e@linux.alibaba.com> From: Ryan Roberts In-Reply-To: <5f949c1f-c56e-4227-af60-05a2a19f4c2e@linux.alibaba.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: F2943180011 X-Rspam-User: X-Stat-Signature: 7q97ftkf8543tx9ypmiaahi6cw8x4my7 X-HE-Tag: 1713952630-273900 X-HE-Meta: U2FsdGVkX1/CY+AVnz/Juak2kI6/hwT2wQRM0dtA0O5k2SUTYmDwwGUN5BDf8R70piEE0gyCLUzWixie9Pf/XS5xBjKSbzb74aAVKv4rDdlXnIbt+SscE1oEpR+kHzjJEMxpSAGf3ZOMsE0fEpHHr1qt9F8xwCOHvdfbqF9ObsvFjSiwoyO3OnFTOr3gLCDvypjFiZc/j/bBg0YKrkW2MrjbCk6lJo+zmYbUWVkRFTsodbgwhrXh+gyc7FuDJPKLh73G5mV/7JjiBKlJvmOT2RhxWv1UxMC7UCwpecuY0JaZkiIy3kxd47UuilFblKcOIU0CxObK8gVPb2bQnA9D1ftBlZsXwNuzII/UsfeXm+cm/60hbWPcke081fGvgR4GvCccYkU0znNsdyeLj0wh6Zr0eHTLJdZNMQyki79D3UWnb87d9pZ9U7vAeqzVjWOCQ6WA1iCZ+wSN9nmc7iVtAIsbVj25ykSdwR0I/convnxqwmjnUzolKWdv21iQ+EOh8aCq5X0LQa7K5lZDxT9zK1681QwKfKY89qK8qp3nuNAbntEN3er5x7WXREtHnLK5KQdQkHubhvP1NS7R1KiJos2Dg/TAXpWEDxPse2fhmsJMSW5PKqFASSuVjNOGwcWlCh8xV68Yc0/KGH1sCPLtc/gmsXhy+dO6fv5SJc4PKlpkp6Run1wJS1w2zYENlco+fpxCgfAV+5ncdEAgvtNhfta/dS1XCbonEoXRGV7JrlLIrbCGafGbOLCoRmj3g3q++Uo8u41GPtyhx774/2EHazJGzgOpMGbVF6DjU4Kzbq5aWAokFf7hsiarEBb7QraFy/Oce3mc7LPQc4OLlTrnKnBsawAYME/stavYvbJ7FeD5rMy0Y9qZkp0loTGD3ejnCjTUkRvZFHQZZS2VtVi57n9KjxSU7UpZTdvSRE5vcH3WNb7pRQ+PchoYh8doIY8zFdQLpZWFITPatmaWaCZ FSiyvT73 ezKAXZbKeb1+/IPYhB5xu+FHrHRB40bTtIf7Mn7azDYqnk0jGMf5S0Gi9MdKYvM8kWNdAUE67trJ9ilZexZd/ZsftTplj9gL+rY5PCLzDtYUrlw58GCVPN71pDKhA7Y7PiAw9bCcfjWPxSlF2Owc0tBV5nkdizqsTxeDxaHm1pnkbsYj55CbDNqG6qBEa4QcbzEed1n0a0jO1Rf7Y5jJZebabSaXrEn00quIQFMq52HeTTaVuxJmoa89Tpw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 24/04/2024 10:26, Baolin Wang wrote: > > > On 2024/4/24 16:07, Ryan Roberts wrote: >> On 24/04/2024 04:23, Baolin Wang wrote: >>> >>> >>> On 2024/4/23 19:03, Ryan Roberts wrote: >>>> On 22/04/2024 08:02, Baolin Wang wrote: >>>>> Add large folio mapping establishment support for finish_fault() as a >>>>> preparation, >>>>> to support multi-size THP allocation of anonymous shared pages in the >>>>> following >>>>> patches. >>>>> >>>>> Signed-off-by: Baolin Wang >>>>> --- >>>>>    mm/memory.c | 25 ++++++++++++++++++------- >>>>>    1 file changed, 18 insertions(+), 7 deletions(-) >>>>> >>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>> index b6fa5146b260..094a76730776 100644 >>>>> --- a/mm/memory.c >>>>> +++ b/mm/memory.c >>>>> @@ -4766,7 +4766,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf) >>>>>    { >>>>>        struct vm_area_struct *vma = vmf->vma; >>>>>        struct page *page; >>>>> +    struct folio *folio; >>>>>        vm_fault_t ret; >>>>> +    int nr_pages, i; >>>>> +    unsigned long addr; >>>>>          /* Did we COW the page? */ >>>>>        if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) >>>>> @@ -4797,22 +4800,30 @@ vm_fault_t finish_fault(struct vm_fault *vmf) >>>>>                return VM_FAULT_OOM; >>>>>        } >>>>>    +    folio = page_folio(page); >>>>> +    nr_pages = folio_nr_pages(folio); >>>>> +    addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); >>>> >>>> I'm not sure this is safe. IIUC, finish_fault() is called for any file-backed >>>> mapping. So you could have a situation where part of a (regular) file is mapped >>>> in the process, faults and hits in the pagecache. But the folio returned by the >>>> pagecache is bigger than the portion that the process has mapped. So you now >>>> end >>>> up mapping beyond the VMA limits? In the pagecache case, you also can't assume >>>> that the folio is naturally aligned in virtual address space. >>> >>> Good point. Yes, I think you are right, I need consider the VMA limits, and I >>> should refer to the calculations of the start pte and end pte in >>> do_fault_around(). >> >> You might also need to be careful not to increase reported RSS. I have a vague >> recollection that David once mentioned a problem with fault-around because it >> causes the reported RSS to increase for the process and this could lead to >> different decisions in other places. IIRC Redhat had an advisory somewhere with >> suggested workaround being to disable fault-around. For the anon-shared memory >> case, it shouldn't be a problem because the user has opted into allocating >> bigger blocks, but there may be a need to ensure we don't also start eagerly >> mapping regular files beyond what fault-around is configured for. > > Thanks for reminding. And I also agree with you that this should not be a > problem since user has selected the larger folio, which is not the same as > fault-around. > >>>>>        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, >>>>> -                      vmf->address, &vmf->ptl); >>>>> +                       addr, &vmf->ptl); >>>>>        if (!vmf->pte) >>>>>            return VM_FAULT_NOPAGE; >>>>>          /* Re-check under ptl */ >>>>> -    if (likely(!vmf_pte_changed(vmf))) { >>>>> -        struct folio *folio = page_folio(page); >>>>> - >>>>> -        set_pte_range(vmf, folio, page, 1, vmf->address); >>>>> -        ret = 0; >>>>> -    } else { >>>>> +    if (nr_pages == 1 && vmf_pte_changed(vmf)) { >>>>>            update_mmu_tlb(vma, vmf->address, vmf->pte); >>>>>            ret = VM_FAULT_NOPAGE; >>>>> +        goto unlock; >>>>> +    } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) { >>>> >>>> I think you have grabbed this from do_anonymous_page()? But I'm not sure it >>>> works in the same way here as it does there. For the anon case, if userfaultfd >>>> is armed, alloc_anon_folio() will only ever allocate order-0. So we end up in >>> >>> IMO, the userfaultfd validation should do in the vma->vm_ops->fault() callback, >>> to make sure the nr_pages is always 1 if userfaultfd is armed. >> >> OK. Are you saying there is already logic to do that today? Great! > > I mean I should add the userfaultfd validation in shmem_fault(), and may be need > add a warning in finish_fault() to catch this issue if other > vma->vm_ops->fault() will support large folio allocation? > > WARN_ON(nr_pages > 1 && userfaultfd_armed(vma)); That adds quite a subtle requirement to vm_ops::fault() though, which I guess is implemented in a lot of places. It would be better if it could be handled centrally - i.e. that all the ptes are either none or a uffd marker? I'm sure there would be corner cases to think about if taking that route.