From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9394CC25B76 for ; Mon, 3 Jun 2024 08:29:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 162096B009B; Mon, 3 Jun 2024 04:29:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1129B6B009C; Mon, 3 Jun 2024 04:29:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F1B716B009E; Mon, 3 Jun 2024 04:29:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id D4E846B009B for ; Mon, 3 Jun 2024 04:29:38 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 7EF5D8085C for ; Mon, 3 Jun 2024 08:29:38 +0000 (UTC) X-FDA: 82188903636.23.1DE1744 Received: from out30-113.freemail.mail.aliyun.com (out30-113.freemail.mail.aliyun.com [115.124.30.113]) by imf27.hostedemail.com (Postfix) with ESMTP id 4260C4001A for ; Mon, 3 Jun 2024 08:29:34 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=Js8w8dOP; spf=pass (imf27.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717403376; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dIiVItZAj91AE/h8Hfi5v4PdhuYCVz06ZTW0yjoFSKA=; b=gkql7QVpwjm38X3E20MM6Ab/0Y6PHU34mxUVPN4yiKkJmeU60L2IOWAlF3lWV+JS+KAuSE Jw9lSddFLsiXC3mempXUG/E8036hYexK6stJEUpmt5sZzcztFSW817k1EsyOPIEI2Oct// kwlTpVtNYL+aNrNxOVxqMX/+c0q3t54= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=Js8w8dOP; spf=pass (imf27.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.113 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717403376; a=rsa-sha256; cv=none; b=MmNOgLBxFugaNIgyV17ywtkA+qKzBqiDf73DmBra09+797lXhzlV2z3I7fQFb58dvm/s6/ gudmdkD29mBItnfowSVsPA/xZlRhCLr7h0Zku48msLhuBs1KIik8UcuUsxtkXlUrfUiz9U kj8P+/zgbh22oQOt+ioYbal56YtONP0= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1717403372; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=dIiVItZAj91AE/h8Hfi5v4PdhuYCVz06ZTW0yjoFSKA=; b=Js8w8dOPvqdJnZvStshNJC+U5/qndL6Dg3/sswu1+ScAUkW3nuhyolHk+gjU6xWQ4F83PlF5HDV14LYvaEK7Y/rMBCqcnXyg0pO1D1xhdQoAponsBJrItqeD7WrZh5XHjcOjfTkGpwICE19uHHjLMO0rt3TsxcoF1isgPAC9sbI= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R101e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=maildocker-contentspam033037067113;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=15;SR=0;TI=SMTPD_---0W7lD6wK_1717403369; Received: from 30.97.56.74(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0W7lD6wK_1717403369) by smtp.aliyun-inc.com; Mon, 03 Jun 2024 16:29:30 +0800 Message-ID: Date: Mon, 3 Jun 2024 16:29:29 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 1/6] mm: memory: extend finish_fault() to support large folio To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, hughd@google.com, willy@infradead.org, david@redhat.com, wangkefeng.wang@huawei.com, ying.huang@intel.com, ryan.roberts@arm.com, shy828301@gmail.com, ziy@nvidia.com, ioworker0@gmail.com, da.gomez@samsung.com, p.raghav@samsung.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 4260C4001A X-Stat-Signature: 6bidggw8zm9ckdt917auepp1higqxt6u X-HE-Tag: 1717403374-914094 X-HE-Meta: U2FsdGVkX1+wwxKcdJz5BGmiQZ/W54DO6vQn9/wTovBVUvAfDqRPV+WFOMkhf7Cl0QjU9ryM7FfJAh5hYxLIpWFgmQ+0KX8o35pG335GuArS/HscrAs9tzN1wGzDWabjM9cwFp0yd1xno4esmqT8w+Qk/RodjHoRZKNOfS966MfXCmLAhFflzpyMTSYWa4Ft2qnJRCZ2gjHYl4U55yhkwU0Ok5WOTiZWuW3VnKZW0nOjRHcCH447vDlUR1wh550pump7JlgbtV61k0W5PID0X+hteq8VJOPkvbA/fEmAjJqkby4mV4hywt/em0/9cOJNUaxYn7cu80EQeSmZviIRJZLvaBuolMpr9bXsfljOHDLhbBsosu3KfPEEucmrAIiF97a+pzgWWW0MQ+YlLS6WuPTpuEmqhx4Xt0YzRleTpdfbBfs8nI2rlefzS/Cp7KCtxK6/z9p2VuXF1/eclqYHwQe46bckMBaM84RQdjlvoEQCrHLpHDFLlVxMvUum2qVwZSOeSX4BJdiweH9vu+f76ix9ngxZPdhBPaHPk9ubINppn5W6kfgT82vMHGnAIKCkPEBtM/0dNKAWJVHjWA/pjLx7/iml3h/i/i2By6opu6x/wXWLel+2ClwMJDQ+G+99tYTzdsERECPAGtOD3wb9LOMYw6c3DUwuiZtIl2qGAwQFh/diZSxvp5Ur2FSoxIEyieIbjajxuBv9xJJJlZkNpmaqS1XrtX1CYb+4FmQkfsbDIHPV/jrmqzwxYxuqanZIzYQocAodbRJu2wTwSniJTCkFJTBwfYn0h4hxzXZ/N7u7RaNYe7jo9LG0K2Tq4GCJ7isBzXcntpcQAjRiVDni2WfGbf/nli7hyEaIVZ9MT9GMx1V8PZyPc2hCrmjnB98d9QRimGNOtCbbICKxtb9xFt6myLgQTvFb01w5AM9tKAaHcaZmUVnRTUjwRe6Kqp350zyYNtmTYBlccFweOQU lUA2FJla msO6bCGLCCM1njZfDi6KLrJklN3BI2nG7VPQtl6ckEm+7OxIJ00yOxrUtyXzPOesEtkMRzm3kKM6AcwCEBs6H9l13Eq+RdWQRUN4ib0M6jXjZiWiWwarY+999qA4nJ1H88/BgwA+/pxFqF+anXC8YOTE1dwYExtrxJ+CkYcQ2P57pt/9EqJYzRcYpNigscDclU11ZOkZcHBaeYF24ZIBgObAVO3SMIJ3NtALZuMXjUH50C7sQfPk69x8Rz7tjZ+0AjLi6NMgza1BI4SX27KpszMbDq4FO17McqjmGME2P8S0nOLKAJg5eaZB3RZojDsPIBTMkKIPZ/yHWGDRRD5GgbkgzapLCinWgkeh/s+MbHknJapdyo0FmN3+9X+1fUUegbHgKQlWZNSU87b0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/6/3 13:28, Barry Song wrote: > On Thu, May 30, 2024 at 2:04 PM Baolin Wang > wrote: >> >> Add large folio mapping establishment support for finish_fault() as a preparation, >> to support multi-size THP allocation of anonymous shmem pages in the following >> patches. >> >> Signed-off-by: Baolin Wang >> --- >> mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++--------- >> 1 file changed, 48 insertions(+), 10 deletions(-) >> >> diff --git a/mm/memory.c b/mm/memory.c >> index eef4e482c0c2..435187ff7ea4 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf) >> { >> struct vm_area_struct *vma = vmf->vma; >> struct page *page; >> + struct folio *folio; >> vm_fault_t ret; >> bool is_cow = (vmf->flags & FAULT_FLAG_WRITE) && >> !(vma->vm_flags & VM_SHARED); >> + int type, nr_pages, i; >> + unsigned long addr = vmf->address; >> >> /* Did we COW the page? */ >> if (is_cow) >> @@ -4864,24 +4867,59 @@ vm_fault_t finish_fault(struct vm_fault *vmf) >> return VM_FAULT_OOM; >> } >> >> + folio = page_folio(page); >> + nr_pages = folio_nr_pages(folio); >> + >> + /* >> + * Using per-page fault to maintain the uffd semantics, and same >> + * approach also applies to non-anonymous-shmem faults to avoid >> + * inflating the RSS of the process. > > I don't feel the comment explains the root cause. > For non-shmem, anyway we have allocated the memory? Avoiding inflating > RSS seems not so useful as we have occupied the memory. the memory footprint This is also to keep the same behavior as before for non-anon-shmem, and will be discussed in the future. > is what we really care about. so we want to rely on read-ahead hints of subpage > to determine read-ahead size? that is why we don't map nr_pages for non-shmem > files though we can potentially reduce nr_pages - 1 page faults? IMHO, there is 2 cases for non-anon-shmem: (1) read mmap() faults: we can rely on the 'fault_around_bytes' interface to determin what size of mapping to build. (2) writable mmap() faults: I want to keep the same behavior as before (per-page fault), but we can talk about this when I send new patches to use mTHP to control large folio allocation for writable mmap(). >> + */ >> + if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma))) { >> + nr_pages = 1; >> + } else if (nr_pages > 1) { >> + pgoff_t idx = folio_page_idx(folio, page); >> + /* The page offset of vmf->address within the VMA. */ >> + pgoff_t vma_off = vmf->pgoff - vmf->vma->vm_pgoff; >> + >> + /* >> + * Fallback to per-page fault in case the folio size in page >> + * cache beyond the VMA limits. >> + */ >> + if (unlikely(vma_off < idx || >> + vma_off + (nr_pages - idx) > vma_pages(vma))) { >> + nr_pages = 1; >> + } else { >> + /* Now we can set mappings for the whole large folio. */ >> + addr = vmf->address - idx * PAGE_SIZE; >> + page = &folio->page; >> + } >> + } >> + >> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, >> - vmf->address, &vmf->ptl); >> + addr, &vmf->ptl); >> if (!vmf->pte) >> return VM_FAULT_NOPAGE; >> >> /* Re-check under ptl */ >> - if (likely(!vmf_pte_changed(vmf))) { >> - struct folio *folio = page_folio(page); >> - int type = is_cow ? MM_ANONPAGES : mm_counter_file(folio); >> - >> - set_pte_range(vmf, folio, page, 1, vmf->address); >> - add_mm_counter(vma->vm_mm, type, 1); >> - ret = 0; >> - } else { >> - update_mmu_tlb(vma, vmf->address, vmf->pte); >> + if (nr_pages == 1 && unlikely(vmf_pte_changed(vmf))) { >> + update_mmu_tlb(vma, addr, vmf->pte); >> ret = VM_FAULT_NOPAGE; >> + goto unlock; >> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) { > > In what case we can't use !pte_range_none(vmf->pte, 1) for nr_pages == 1 > then unify the code for nr_pages==1 and nr_pages > 1? > > It seems this has been discussed before, but I forget the reason. IIUC, this is for uffd case, which is not a none pte entry.