From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DDFB9C04FFE for ; Thu, 9 May 2024 01:10:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 482DD6B0092; Wed, 8 May 2024 21:10:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 40BAD6B0093; Wed, 8 May 2024 21:10:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2AC986B0095; Wed, 8 May 2024 21:10:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 0BE276B0092 for ; Wed, 8 May 2024 21:10:46 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 87B3E1A06E3 for ; Thu, 9 May 2024 01:10:45 +0000 (UTC) X-FDA: 82097077650.09.9D1190E Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by imf07.hostedemail.com (Postfix) with ESMTP id A9F9C40007 for ; Thu, 9 May 2024 01:10:42 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=Sdv5WtSW; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf07.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715217043; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4BqDry6rZsEe7WYoV9P+LOQn47pNlnrQoyVly4Xb9f0=; b=BxbwH8Y1EdEpiLHEyP2Xl5rTLgqq8CIJfzROEhcKZ1ypLJBSS1wpUmxHocloprQ7IPEvPJ rTHtw0/+X6dyCOpDIbTx2M2o5/2T3QTTv6HYEmQa5az8RKLwiV44Qe8sQXVihFKr/7aBS7 CJjAp3olr/Trgd45dOpbEnEB2ZM+r98= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=Sdv5WtSW; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf07.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715217043; a=rsa-sha256; cv=none; b=kyksWzJM3UgpJI3Bjq5N+En65/PUr4weBmjKIuiHCrinaO8OyZmK0wFIdrMd4Dq105G0Rz mEzyVuw+rNceGr1TMgxSCM9LiQFNNtPPOHZD2Vgn1VEU4bNDxx4ip2HEF3FBkY1uXFjwe9 nkvIR23pp4K9yOpstkPjLjiwiDjecWE= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1715217039; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=4BqDry6rZsEe7WYoV9P+LOQn47pNlnrQoyVly4Xb9f0=; b=Sdv5WtSWKgGYfyUtgF3O5j7fwLhM8NkjSPtqumoPADafur2UqzkRYmi9uwULmO1J+Q2TPLnyaXCZJBOxa2ef+nyVrBxsCDV1tcF+rVRu09JUSp5uEDDTmCmDMxWw6NA8yFbzyA9gFRzjN8gaEF1hpji9z7C85Ifykm53xzPETX0= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R601e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033037067111;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0W64wo0V_1715217036; Received: from 30.97.56.60(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0W64wo0V_1715217036) by smtp.aliyun-inc.com; Thu, 09 May 2024 09:10:38 +0800 Message-ID: Date: Thu, 9 May 2024 09:10:36 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 2/8] mm: memory: extend finish_fault() to support large folio To: Ryan Roberts , akpm@linux-foundation.org, hughd@google.com Cc: willy@infradead.org, david@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ying.huang@intel.com, 21cnbao@gmail.com, shy828301@gmail.com, ziy@nvidia.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <13939ade-a99a-4075-8a26-9be7576b7e03@arm.com> <3d87da24-7887-4912-abcf-14062e8514de@linux.alibaba.com> <900579ab-ea0c-4ce1-ad33-4f81827081d4@arm.com> From: Baolin Wang In-Reply-To: <900579ab-ea0c-4ce1-ad33-4f81827081d4@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: A9F9C40007 X-Stat-Signature: gb3zd1s1fp9qio8ry5sjm8c4g9t1eg3q X-Rspam-User: X-HE-Tag: 1715217042-408990 X-HE-Meta: U2FsdGVkX1/kcVgFA3faQ3J7kxlebE4/6TWMJzbJblLvHW2DcBT1yVAZMZVemh7WTq1O3pT6PF2p/WY9NzG/N5hU9l5kqNRDBvfAkufs+RZQ44scodOVV7ba0SdF3Z7oQC6PEWvliIXiMoue7wbpqoHmqNBzwYUZainRehPjpz4BXd+rx0t7MNzWOiquvFwl1eV9OXCWpYlhUYklxGvI27km1L5IVHROZLYs6UW32PkmcAKu1jRH3Qu+dk8cHfLH7pcrfjLsARznxT4NzvvKlaqh9pMG89NxXXFpBM++uDGVFjvrcGbdoAW6z3VxdRY8LVfuKA7KibpvnjvDBJk+z0hbOzc+tNIpSaOv/A7pypBgEKLkPU8Ypu09aexaLCjfB6tKtyx2Pp2CA+T2ofh5OWHCWWrwmkHlfkzMa53H4KCOe0MRSJNb/i1seAM9C0fdX5Asp0U89Eg3p67+b+0Z7GjfLOviexotYbY7u3oYHaQGQuOPjzeQ0Nh3rUoJDKUYM/vEdHoMPgW8a+bdPHXqpQI2Lzn7SEjTmyV0mptC5hqZRN99V9LknDjYkJlcAptRRqh9VYdcZhsZt/RB4wVgUejOvPRVgOf8rI+5Nb1Cna4SGHc6vrcfgBB6tbi5xwc3s3eHCCp79kZMYKD2C0iLyMQahaBeeOKcqBmCo5bz5NuB35Ev7hAe8e4opmmRCoqCz0kUgRskd+oCgCndzI4aGQiYalcg1/mPSsZtw+c47izRmA0SolK7Nsha3/XtBvEHVGoIoSFl/dgGNXW52TDTOlF6qm424MvHTkrgkyPBR1PS9Gyj6Vo41cEEX5zhWSEPhKuE3SA+oI1Rg4ctsGVcxZ0wbjBGPhEcM+r+ZLFwSM04JGtoSXLcXBXHnPHOOaY52HC+5brOygPHQAcOczLwp1CoKpgXWn+xhRtS2HepxhcKLsshjH/u1tsdgt0Y8eudZESDCSOdAOtg05SV1bd zh6JioFr htO/w39tmFv20L9QrwvbsHRkfNF69aMSu6sgBIX9hzEkMtDbEmhzMWEgWAccH98lG+ULNE94jOjCVaHZOt4A5iStOuyA2kXuN6zJ3EdsiDQzIjYJlmfX31Ypf3P4mKws6nrD3450eanLHEegNP1hIbvMW7LGGps9Tl/BgPfUSmNgicCjTOS1V8dvjzUYzczKnyr3v/emqTMWD9NFS/30it1BAbqjItx2nXl0cQp1g0OTQl68iitHY4pyVkRyXiXEklKf7NAhcE7VgbO7fPXFplF4uCHHSJifdaqi3f9j+4ng4M9M3VoYqSnTS0cIYN9bg1hDSglxSZMYV6f1KIdPpN8VShve0uuQ9NMMG1Js4d+wbIqyuLS9qxDLytzkGD2+cVV3xLbWh8Sr0WI8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/5/8 18:47, Ryan Roberts wrote: > On 08/05/2024 10:31, Baolin Wang wrote: >> >> >> On 2024/5/8 16:53, Ryan Roberts wrote: >>> On 08/05/2024 04:44, Baolin Wang wrote: >>>> >>>> >>>> On 2024/5/7 18:37, Ryan Roberts wrote: >>>>> On 06/05/2024 09:46, Baolin Wang wrote: >>>>>> Add large folio mapping establishment support for finish_fault() as a >>>>>> preparation, >>>>>> to support multi-size THP allocation of anonymous shmem pages in the following >>>>>> patches. >>>>>> >>>>>> Signed-off-by: Baolin Wang >>>>>> --- >>>>>>    mm/memory.c | 43 +++++++++++++++++++++++++++++++++---------- >>>>>>    1 file changed, 33 insertions(+), 10 deletions(-) >>>>>> >>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>> index eea6e4984eae..936377220b77 100644 >>>>>> --- a/mm/memory.c >>>>>> +++ b/mm/memory.c >>>>>> @@ -4747,9 +4747,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf) >>>>>>    { >>>>>>        struct vm_area_struct *vma = vmf->vma; >>>>>>        struct page *page; >>>>>> +    struct folio *folio; >>>>>>        vm_fault_t ret; >>>>>>        bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) && >>>>>>                  !(vma->vm_flags & VM_SHARED); >>>>>> +    int type, nr_pages, i; >>>>>> +    unsigned long addr = vmf->address; >>>>>>          /* Did we COW the page? */ >>>>>>        if (is_cow) >>>>>> @@ -4780,24 +4783,44 @@ vm_fault_t finish_fault(struct vm_fault *vmf) >>>>>>                return VM_FAULT_OOM; >>>>>>        } >>>>>>    +    folio = page_folio(page); >>>>>> +    nr_pages = folio_nr_pages(folio); >>>>>> + >>>>>> +    if (unlikely(userfaultfd_armed(vma))) { >>>>>> +        nr_pages = 1; >>>>>> +    } else if (nr_pages > 1) { >>>>>> +        unsigned long start = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); >>>>>> +        unsigned long end = start + nr_pages * PAGE_SIZE; >>>>>> + >>>>>> +        /* In case the folio size in page cache beyond the VMA limits. */ >>>>>> +        addr = max(start, vma->vm_start); >>>>>> +        nr_pages = (min(end, vma->vm_end) - addr) >> PAGE_SHIFT; >>>>>> + >>>>>> +        page = folio_page(folio, (addr - start) >> PAGE_SHIFT); >>>>> >>>>> I still don't really follow the logic in this else if block. Isn't it possible >>>>> that finish_fault() gets called with a page from a folio that isn't aligned >>>>> with >>>>> vmf->address? >>>>> >>>>> For example, let's say we have a file who's size is 64K and which is cached >>>>> in a >>>>> single large folio in the page cache. But the file is mapped into a process at >>>>> VA 16K to 80K. Let's say we fault on the first page (VA=16K). You will >>>>> calculate >>>> >>>> For shmem, this doesn't happen because the VA is aligned with the hugepage size >>>> in the shmem_get_unmapped_area() function. See patch 7. >>> >>> Certainly agree that shmem can always make sure that it packs a vma in a way >>> such that its folios are naturally aligned in VA when faulting in memory. If you >>> mremap it, that alignment will be lost; I don't think that would be a problem >> >> When mremap it, it will also call shmem_get_unmapped_area() to align the VA, but >> for mremap() with MAP_FIXED flag as David pointed out, yes, this patch may be >> not work perfectly. > > Assuming this works similarly to anon mTHP, remapping to an arbitrary address > shouldn't be a problem within a single process; the previously allocated folios > will now be unaligned, but they will be correctly mapped so it doesn't break > anything. And new faults will allocate folios so that they are as large as > allowed by the sysfs interface AND which do not overlap with any non-none pte > AND which are naturally aligned. It's when you start sharing with other > processes that the fun and games start... > >> >>> for a single process; mremap will take care of moving the ptes correctly and >>> this path is not involved. >>> >>> But what about the case when a process mmaps a shmem region, then forks, then >>> the child mremaps the shmem region. Then the parent faults in a THP into the >>> region (nicely aligned). Then the child faults in the same offset in the region >>> and gets the THP that the parent allocated; that THP will be aligned in the >>> parent's VM space but not in the child's. >> >> Sorry, I did not get your point here. IIUC, the child's VA will also be aligned >> if the child mremap is not set MAP_FIXED, since the child's mremap will still >> call shmem_get_unmapped_area() to find an aligned new VA. > > In general, you shouldn't be relying on the vma bounds being aligned to a THP > boundary. > >> Please correct me if I missed your point. > > (I'm not 100% sure this is definitely how it works, but seems the only sane way > to me): > > Let's imagine we have a process that maps 4 pages of shared anon memory at VA=64K: > > mmap(64K, 16K, PROT_X, MAP_SHARED | MAP_ANONYMOUS | MAP_FIXED, ...) > > Then it forks a child, and the child moves the mapping to VA=68K: > > mremap(64K, 16K, 16K, MREMAP_FIXED | MREMAP_MAYMOVE, 68K) > > Then the parent writes to address 64K (offset 0 in the shared region); this will > fault and cause a 16K mTHP to be allocated and mapped, covering the whole region > at 64K-80K in the parent. > > Then the child reads address 68K (offset 0 in the shared region); this will > fault and cause the previously allocated 16K folio to be looked up and it must > be mapped in the child between 68K-84K. This is not naturally aligned in the child. > > For the child, your code will incorrectly calculate start/end as 64K-80K. OK, so you set MREMAP_FIXED flag, just as David pointed out. Yes, it will not aligned in the child for this case. Thanks for the explanation.