From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DFAA5E94133 for ; Sat, 7 Oct 2023 04:26:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 02FAF80023; Sat, 7 Oct 2023 00:26:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F216F80008; Sat, 7 Oct 2023 00:26:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DEA0D80023; Sat, 7 Oct 2023 00:26:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id CDC0E80008 for ; Sat, 7 Oct 2023 00:26:08 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 8C192120259 for ; Sat, 7 Oct 2023 04:26:08 +0000 (UTC) X-FDA: 81317378016.08.ED6A77D Received: from mail-oi1-f176.google.com (mail-oi1-f176.google.com [209.85.167.176]) by imf30.hostedemail.com (Postfix) with ESMTP id AAEA180002 for ; Sat, 7 Oct 2023 04:26:05 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=I7ahZveV; spf=pass (imf30.hostedemail.com: domain of zhangpeng.00@bytedance.com designates 209.85.167.176 as permitted sender) smtp.mailfrom=zhangpeng.00@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696652766; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Rslvp6AitoH4rS23dFyU/ODj+5z33u/RNEcErDsfMPw=; b=oc1zm1V9USOVtGawNE7lP5vkB9szCpF6o7Oibi4TDv4TJ4EM/8hcNM/1Gd7Y/89W34buPf h7QThojFdHyc1RjS/4JOlwqMChWxRpKvwt8OiVFGKAHu60ifwuypktd7U+/YT5klAHou1+ J5ZtIYUVHyQyY0+cCHrzNdOTJRyC77Q= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696652766; a=rsa-sha256; cv=none; b=YvUh98/eIvHYHtWDdXsVflprzTWPCHyqYQAcc/JuaC0AAg9qjMx+ZmRvO0USpXydmIz3Ia Hzyj0nJRlFu2ugdMosbmX5A0AtnJVfOpOi7ht24rY1la8rWlyS7x15AAr/ILeUk92PFW7u ujD00SmZ7GagpSwdjw4E6EA/pVPWp04= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=I7ahZveV; spf=pass (imf30.hostedemail.com: domain of zhangpeng.00@bytedance.com designates 209.85.167.176 as permitted sender) smtp.mailfrom=zhangpeng.00@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-oi1-f176.google.com with SMTP id 5614622812f47-3ae5ee80c0dso1793684b6e.3 for ; Fri, 06 Oct 2023 21:26:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1696652764; x=1697257564; darn=kvack.org; h=in-reply-to:from:cc:references:to:subject:user-agent:mime-version :date:message-id:from:to:cc:subject:date:message-id:reply-to; bh=Rslvp6AitoH4rS23dFyU/ODj+5z33u/RNEcErDsfMPw=; b=I7ahZveV8pWJcKPy3/bLUW9WV2H3QXCA9zGNs+ipYanclIiLLsJSqVR34R4a72Qy45 Jjk5zyVv3AgXBzGg6c439LgqHn4NDvKC7oJQXMokIURzJKakwMdUdmdLBFcCoQpXWoZH 8TKmd5Lwyez/Z4NSCB7+LTnSGReNkMYtgarBaNjcCKlaLg9/A05MNM9mJbtwQUy28f6A PSXFinfTAHRZm6mTtXHB0ECvDixSxHneVlT0sqWy1otWNp2vCSIG2yaqtxjeYfzITvR7 9jBKiQHc6oQ2wZi9tnerxhA59seTvpwo9fYPTaTgaWU680QIPBGqtYdfQevmdh1rvGKd SmbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696652764; x=1697257564; h=in-reply-to:from:cc:references:to:subject:user-agent:mime-version :date:message-id:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Rslvp6AitoH4rS23dFyU/ODj+5z33u/RNEcErDsfMPw=; b=Sg0Fpx83vONiNnaeG4h0YzpZe1UmQE/3ZWD5NUaw5LYJbAXsWhDkWx4JvXHnE2lo1a UweZclDI6oJtStdcC8aTIK5sF0t6HtNNkexgYf7tgOHOF0yNKtydkkEaa57nlmC3YZtr d5Oqp7Rsk22ob4XaH+0NMqZtTrzd8Xs2yAN41v16+S/fW96VWQXz6SMfrGoO5zNU3yLF U41TAZjdfHoXGo3IV058mCUVOW4IS1wkLjgT8biOL6C0k6nGCIIcbAESgRErj4g0Al6w lT+n7SeYjyfndTwxQjKsP3X8EARWOOrpnr93dEBIUMC3Ws7iFDKp9iCmwGw2k1HLfSMd 8U0Q== X-Gm-Message-State: AOJu0YyChHksjwl06zeAG+cAaqDSK9UZDGPC2pNqWNQS84MM0qAv+1bi Hc8vrJWYhx1YJm4K67o0GSygpw== X-Google-Smtp-Source: AGHT+IHH7Dy75gvrykHzeQf6CZs8dmhESWZVwzioYdWgJ8hBUOo2/Yy34HZgakNb6jz5tW8n9Nqa2A== X-Received: by 2002:a05:6358:7f1c:b0:143:96ac:96da with SMTP id p28-20020a0563587f1c00b0014396ac96damr7391613rwn.2.1696652764294; Fri, 06 Oct 2023 21:26:04 -0700 (PDT) Received: from [10.254.225.239] ([139.177.225.225]) by smtp.gmail.com with ESMTPSA id mt12-20020a17090b230c00b0027b168cb02asm5035336pjb.9.2023.10.06.21.25.57 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 06 Oct 2023 21:26:03 -0700 (PDT) Content-Type: multipart/mixed; boundary="------------0ZF2aWYZwteDjnDjJjSqArOh" Message-ID: <9d347c17-55dd-a772-4d82-5d18b1206bc4@bytedance.com> Date: Sat, 7 Oct 2023 12:25:55 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [PATCH v3 9/9] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() To: "Liam R. Howlett" References: <20230925035617.84767-1-zhangpeng.00@bytedance.com> <20230925035617.84767-10-zhangpeng.00@bytedance.com> <20231003184634.bbb5c5ezkvi6tkdv@revolver> <58ec7a15-6983-d199-bc1a-6161c3b75e0f@bytedance.com> <20231004195347.yggeosopqwb6ftos@revolver> <785511a6-8636-04e5-c002-907443b34dad@bytedance.com> <20231007011102.koplouxuumlog3cu@revolver> <20231007013231.ctzjap6uzvutuant@revolver> Cc: Peng Zhang , corbet@lwn.net, akpm@linux-foundation.org, willy@infradead.org, brauner@kernel.org, surenb@google.com, michael.christie@oracle.com, mjguzik@gmail.com, mathieu.desnoyers@efficios.com, npiggin@gmail.com, peterz@infradead.org, oliver.sang@intel.com, maple-tree@lists.infradead.org, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org From: Peng Zhang In-Reply-To: <20231007013231.ctzjap6uzvutuant@revolver> X-Stat-Signature: tg5q687t1a3pba7h8qdgw4k9smgjnpu3 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: AAEA180002 X-Rspam-User: X-HE-Tag: 1696652765-250365 X-HE-Meta: U2FsdGVkX19sY2UMj9qmDll5ODQ9URmwRnpo0Kz/dkuxu0hCi77mYEa32j4i2WFDcR4R4jcVa6PCwlviRy47dlZ7pUIJPuoIS50Sn90HVAkS6qbqQcTnt1ad7VSthjOwzFfJsdpWiBMkHlqb9bn1rzGnXJ5aZ4S5THbgZ/e2CwtKGNGwM0NvNk4I26z8oR86Pb56PuBHhPzqMjLjzttP/WVuaLyUxy5LlGFa0X0pzpLSztYXX7mGGpeIV+ZX7zOLpUgasl8sMRuhecdakAA/K7ArHuI6YrS+ckz+udf+lOvNuS+IK2e0tGCb+PqVdTqjL3MhNk97Q1fY0F/F+Adb4m4al7BBjcw+cEaxWH2ZPcfXD2q88Edu1g2pCsZBnZlNqU1BZf95XXPPSYf0FIJSFPVTnr/+FSVHurcjUkUYB2U2ra4NEEq09i5fDBX1Tyh8x4VmQSjaaTlMf0f/rMcDHx3hsJT4p44u0FkXtJztVAOEInTnKM2GbqsMWbA4iQ1eXNlhoEkjLoyLgGr1g38pPkjcKC6MD7DE9i4WjejF1xePVUFGfGRYLQi4QL4HIXdOHDLCVs0vtlXIWumfYjpMYkTC0W4HE7zkEKmdHfWMnVfGs0ovyrkNJe+AR/U6C5bdIKGjaTwQb3kn3F0NJZ/VubS8mVoawlbxCd/5HpugXueqhca3duw+5aFQPj3mxtUG0Ei1oWKyX0DNimcqkslB2IKcfP9Jy721rkareo6wnWtqAjoY4m7EXI0AdaA0BKV+OIYHmJr+nMgUOaJVZfIDoxq3kclmp5/pAhDBqvJnQNfqHAkuvrwzOfo/I21abcgD85gVz6axxMvXR2C+tlkiobsmgp9CWmJKba+eRvCF5rj8w5YHcADbRkXDfc8pZ1PTLX69XRz/CW6ZYMs3LRKBn4u1UoIJ6MVTElGQx1a5pmAd9i98xkoTAT2aERn9T1yTCLMpO2cJMwY6DwT93r5 A9Nxv6MD QN7Q5micGZ0BW1rW4Eocg7kwoWKQ4lCNxx0yCGUUtvfftwEhJPavX/kZUWR3YeEZCiUQbQ4cR8Yy+WKUsjr6tkB+Yx5smqIR05TazZFL8zqfasstvyR0p/8AXrLIogh4qTtxZ47NNQj2TJtui7XAeKRocR4lnLDtnraXKiiZavt7pK03uxQ+SYT9ECk2VwHpqt88oncIQXk9ADOE8CuzZFW0pw4+9ratiDuvbISJ9D4pPxWz2dvuftR1F9D01YDUhVCYBOp2TbQxku/dyMVbG7+ooFNbVRWASF05+gwoNzZQLhnN3uZPmRO7DmgxUeDhnZMoiFqqHLjCjvWYkIrLRq9c7Ot2/B94ziPlkjjsbD52zbL5fRyXLk9Bz9VwklJx7YQK/4/4HDfqJaREslFuNG9cDxspScLAHeG+JRSVi/0hZkcbDjukDq+iT5xou+qUg//R5/4rE/djQXWtH0cVwoZMOW/5VD4XGYKcfRxTYSVrbwwlTpkDEdquJrRknZRy8RLUcu3CAftv3fY2c9mRfWjWhdis97CSm/qdQ5wW70Az+QPs25/VyzzP0cahVUYqBiML4vvYhGdr2aoHQOfifHxYHegiBVUqjUtt0OtwbePkAf9/lS7KkYB62Th+Av/2SyvcCi+vtVwxzX/I/KddbtUjGPwAPTIZgAzC5YmzvBRcuyytwnKPsgn3sWCXPKG93LBtVoj82Z+fKpCkOgmxUuCIyuQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This is a multi-part message in MIME format. --------------0ZF2aWYZwteDjnDjJjSqArOh Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit 在 2023/10/7 09:32, Liam R. Howlett 写道: ... >>>> >>>>>>> >>>>>>> [1] https://github.com/kdlucas/byte-unixbench/tree/master >>>>>>> >>>>>>> Signed-off-by: Peng Zhang >>>>>>> --- >>>>>>> include/linux/mm.h | 1 + >>>>>>> kernel/fork.c | 34 ++++++++++++++++++++---------- >>>>>>> mm/internal.h | 3 ++- >>>>>>> mm/memory.c | 7 ++++--- >>>>>>> mm/mmap.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-- >>>>>>> 5 files changed, 80 insertions(+), 17 deletions(-) >>>>>>> >>>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>>>>>> index 1f1d0d6b8f20..10c59dc7ffaa 100644 >>>>>>> --- a/include/linux/mm.h >>>>>>> +++ b/include/linux/mm.h >>>>>>> @@ -3242,6 +3242,7 @@ extern void unlink_file_vma(struct vm_area_struct *); >>>>>>> extern struct vm_area_struct *copy_vma(struct vm_area_struct **, >>>>>>> unsigned long addr, unsigned long len, pgoff_t pgoff, >>>>>>> bool *need_rmap_locks); >>>>>>> +extern void undo_dup_mmap(struct mm_struct *mm, struct vm_area_struct *vma_end); >>>>>>> extern void exit_mmap(struct mm_struct *); >>>>>>> static inline int check_data_rlimit(unsigned long rlim, >>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c >>>>>>> index 7ae36c2e7290..2f3d83e89fe6 100644 >>>>>>> --- a/kernel/fork.c >>>>>>> +++ b/kernel/fork.c >>>>>>> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >>>>>>> int retval; >>>>>>> unsigned long charge = 0; >>>>>>> LIST_HEAD(uf); >>>>>>> - VMA_ITERATOR(old_vmi, oldmm, 0); >>>>>>> VMA_ITERATOR(vmi, mm, 0); >>>>>>> uprobe_start_dup_mmap(); >>>>>>> @@ -678,16 +677,25 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >>>>>>> goto out; >>>>>>> khugepaged_fork(mm, oldmm); >>>>>>> - retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count); >>>>>>> - if (retval) >>>>>>> + /* Use __mt_dup() to efficiently build an identical maple tree. */ >>>>>>> + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL); >>>>>>> + if (unlikely(retval)) >>>>>>> goto out; >>>>>>> mt_clear_in_rcu(vmi.mas.tree); >>>>>>> - for_each_vma(old_vmi, mpnt) { >>>>>>> + for_each_vma(vmi, mpnt) { >>>>>>> struct file *file; >>>>>>> vma_start_write(mpnt); >>>>>>> if (mpnt->vm_flags & VM_DONTCOPY) { >>>>>>> + mas_store_gfp(&vmi.mas, NULL, GFP_KERNEL); >>>>>>> + >>>>>>> + /* If failed, undo all completed duplications. */ >>>>>>> + if (unlikely(mas_is_err(&vmi.mas))) { >>>>>>> + retval = xa_err(vmi.mas.node); >>>>>>> + goto loop_out; >>>>>>> + } >>>>>>> + >>>>>>> vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); >>>>>>> continue; >>>>>>> } >>>>>>> @@ -749,9 +757,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >>>>>>> if (is_vm_hugetlb_page(tmp)) >>>>>>> hugetlb_dup_vma_private(tmp); >>>>>>> - /* Link the vma into the MT */ >>>>>>> - if (vma_iter_bulk_store(&vmi, tmp)) >>>>>>> - goto fail_nomem_vmi_store; >>>>>>> + /* >>>>>>> + * Link the vma into the MT. After using __mt_dup(), memory >>>>>>> + * allocation is not necessary here, so it cannot fail. >>>>>>> + */ >>>>>>> + mas_store(&vmi.mas, tmp); >>>>>>> mm->map_count++; >>>>>>> if (!(tmp->vm_flags & VM_WIPEONFORK)) >>>>>>> @@ -760,15 +770,19 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >>>>>>> if (tmp->vm_ops && tmp->vm_ops->open) >>>>>>> tmp->vm_ops->open(tmp); >>>>>>> - if (retval) >>>>>>> + if (retval) { >>>>>>> + mpnt = vma_next(&vmi); >>>>>>> goto loop_out; >>>>>>> + } >>>>>>> } >>>>>>> /* a new mm has just been created */ >>>>>>> retval = arch_dup_mmap(oldmm, mm); >>>>>>> loop_out: >>>>>>> vma_iter_free(&vmi); >>>>>>> - if (!retval) >>>>>>> + if (likely(!retval)) >>>>>>> mt_set_in_rcu(vmi.mas.tree); >>>>>>> + else >>>>>>> + undo_dup_mmap(mm, mpnt); >>>>>>> out: >>>>>>> mmap_write_unlock(mm); >>>>>>> flush_tlb_mm(oldmm); >>>>>>> @@ -778,8 +792,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >>>>>>> uprobe_end_dup_mmap(); >>>>>>> return retval; >>>>>>> -fail_nomem_vmi_store: >>>>>>> - unlink_anon_vmas(tmp); >>>>>>> fail_nomem_anon_vma_fork: >>>>>>> mpol_put(vma_policy(tmp)); >>>>>>> fail_nomem_policy: >>>>>>> diff --git a/mm/internal.h b/mm/internal.h >>>>>>> index 7a961d12b088..288ec81770cb 100644 >>>>>>> --- a/mm/internal.h >>>>>>> +++ b/mm/internal.h >>>>>>> @@ -111,7 +111,8 @@ void folio_activate(struct folio *folio); >>>>>>> void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, >>>>>>> struct vm_area_struct *start_vma, unsigned long floor, >>>>>>> - unsigned long ceiling, bool mm_wr_locked); >>>>>>> + unsigned long ceiling, unsigned long tree_end, >>>>>>> + bool mm_wr_locked); >>>>>>> void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); >>>>>>> struct zap_details; >>>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>>> index 983a40f8ee62..1fd66a0d5838 100644 >>>>>>> --- a/mm/memory.c >>>>>>> +++ b/mm/memory.c >>>>>>> @@ -362,7 +362,8 @@ void free_pgd_range(struct mmu_gather *tlb, >>>>>>> void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, >>>>>>> struct vm_area_struct *vma, unsigned long floor, >>>>>>> - unsigned long ceiling, bool mm_wr_locked) >>>>>>> + unsigned long ceiling, unsigned long tree_end, >>>>>>> + bool mm_wr_locked) >>>>>>> { >>>>>>> do { >>>>>>> unsigned long addr = vma->vm_start; >>>>>>> @@ -372,7 +373,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, >>>>>>> * Note: USER_PGTABLES_CEILING may be passed as ceiling and may >>>>>>> * be 0. This will underflow and is okay. >>>>>>> */ >>>>>>> - next = mas_find(mas, ceiling - 1); >>>>>>> + next = mas_find(mas, tree_end - 1); >>>>>>> /* >>>>>>> * Hide vma from rmap and truncate_pagecache before freeing >>>>>>> @@ -393,7 +394,7 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas, >>>>>>> while (next && next->vm_start <= vma->vm_end + PMD_SIZE >>>>>>> && !is_vm_hugetlb_page(next)) { >>>>>>> vma = next; >>>>>>> - next = mas_find(mas, ceiling - 1); >>>>>>> + next = mas_find(mas, tree_end - 1); >>>>>>> if (mm_wr_locked) >>>>>>> vma_start_write(vma); >>>>>>> unlink_anon_vmas(vma); >>>>>>> diff --git a/mm/mmap.c b/mm/mmap.c >>>>>>> index 2ad950f773e4..daed3b423124 100644 >>>>>>> --- a/mm/mmap.c >>>>>>> +++ b/mm/mmap.c >>>>>>> @@ -2312,7 +2312,7 @@ static void unmap_region(struct mm_struct *mm, struct ma_state *mas, >>>>>>> mas_set(mas, mt_start); >>>>>>> free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, >>>>>>> next ? next->vm_start : USER_PGTABLES_CEILING, >>>>>>> - mm_wr_locked); >>>>>>> + tree_end, mm_wr_locked); >>>>>>> tlb_finish_mmu(&tlb); >>>>>>> } >>>>>>> @@ -3178,6 +3178,54 @@ int vm_brk(unsigned long addr, unsigned long len) >>>>>>> } >>>>>>> EXPORT_SYMBOL(vm_brk); >>>>>>> +void undo_dup_mmap(struct mm_struct *mm, struct vm_area_struct *vma_end) >>>>>>> +{ >>>>>>> + unsigned long tree_end; >>>>>>> + VMA_ITERATOR(vmi, mm, 0); >>>>>>> + struct vm_area_struct *vma; >>>>>>> + unsigned long nr_accounted = 0; >>>>>>> + int count = 0; >>>>>>> + >>>>>>> + /* >>>>>>> + * vma_end points to the first VMA that has not been duplicated. We need >>>>>>> + * to unmap all VMAs before it. >>>>>>> + * If vma_end is NULL, it means that all VMAs in the maple tree have >>>>>>> + * been duplicated, so setting tree_end to 0 will overflow to ULONG_MAX >>>>>>> + * when using it. >>>>>>> + */ >>>>>>> + if (vma_end) { >>>>>>> + tree_end = vma_end->vm_start; >>>>>>> + if (tree_end == 0) >>>>>>> + goto destroy; >>>>>>> + } else >>>>>>> + tree_end = 0; >> >> You need to enclose this statement to meet the coding style. You could >> just set tree_end = 0 at the start of the function instead, actually I >> think tree_end = USER_PGTABLES_CEILING unless there is a vma_end. >> >>>>>>> + >>>>>>> + vma = mas_find(&vmi.mas, tree_end - 1); >> >> vma = vma_find(&vmi, tree_end); >> >>>>>>> + >>>>>>> + if (vma) { >> >> Probably would be cleaner to jump to destroy here too: >> if (!vma) >> goto destroy; >> >>>>>>> + arch_unmap(mm, vma->vm_start, tree_end); > > One more thing, it seems the maple state that is passed into > unmap_region() needs to point to the _next_ element, or the reset > doesn't work right between the unmap_vmas() and free_pgtables() call: > > vma_iter_set(&vmi, vma->vm_end); > > >>>>>>> + unmap_region(mm, &vmi.mas, vma, NULL, NULL, 0, tree_end, >>>>>>> + tree_end, true); >>>>>> >>>>>> next is vma_end, as per your comment above. Using next = vma_end allows >>>>>> you to avoid adding another argument to free_pgtables(). >>>>> Unfortunately, it cannot be done this way. I fell into this trap before, >>>>> and it caused incomplete page table cleanup. To solve this problem, the >>>>> only solution I can think of right now is to add an additional >>>>> parameter. >>>>> >>>>> free_pgtables() will be called in unmap_region() to free the page table, >>>>> like this: >>>>> >>>>> free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, >>>>> next ? next->vm_start : USER_PGTABLES_CEILING, >>>>> mm_wr_locked); >>>>> >>>>> The problem is with 'next'. Our 'vma_end' does not exist in the actual >>>>> mmap because it has not been duplicated and cannot be used as 'next'. >>>>> If there is a real 'next', we can use 'next->vm_start' as the ceiling, >>>>> which is not a problem. If there is no 'next' (next is 'vma_end'), we >>>>> can only use 'USER_PGTABLES_CEILING' as the ceiling. Using >>>>> 'vma_end->vm_start' as the ceiling will cause the page table not to be >>>>> fully freed, which may be related to alignment in 'free_pgd_range()'. To >>>>> solve this problem, we have to introduce 'tree_end', and separating >>>>> 'tree_end' and 'ceiling' can solve this problem. >>>> >>>> Can you just use ceiling? That is, just not pass in next and keep the >>>> code as-is? This is how exit_mmap() does it and should avoid any >>>> alignment issues. I assume you tried that and something went wrong as >>>> well? >>> I tried that, but it didn't work either. In free_pgtables(), the >>> following line of code is used to iterate over VMAs: >>> mas_find(mas, ceiling - 1); >>> If next is passed as NULL, ceiling will be 0, resulting in iterating >>> over all the VMAs in the maple tree, including the last portion that was >>> not duplicated. >> >> If vma_end is NULL, it means that all VMAs in the maple tree have been >> duplicated, so shouldn't the correct action in this case be freeing up >> to ceiling? Yes, that's correct. >> >> If it isn't null, then vma_end->vm_start should work as the end of the >> area to free. But there's an issue here. I initially thought the same way, but the behavior of free_pgtables() is very strange. For the last VMA, it seems that the ceiling passed to free_pgd_range() must be USER_PGTABLES_CEILING. It cannot be used vma_end->vm_start as the ceiling, possibly due to the peculiar alignment behavior in free_pgd_range(). The code is from free_pgd_range(): if (ceiling) { ceiling &= PMD_MASK; if (!ceiling) return; } I suspect it is related to this part. The behavior differs when the ceiling is equal to 0 or non-zero. However, I cannot comprehend all the details here. >> >> With your mas_find(mas, tree_end - 1), then the vma_end will be avoided, >> but free_pgd_range() will use ceiling anyways: >> >> free_pgd_range(tlb, addr, vma->vm_end, floor, next ? next->vm_start : ceiling); >> >> Passing in vma_end as next to unmap_region() functions in my testing >> without adding arguments to free_pgtables(). >> >> How are you producing the accounting issue you mention above? Maybe I >> missed something? You can apply the patch provided at the bottom, and then use the test program in Attachment 1 to reproduce the issue. In dmesg, the kernel will report the following error: [ 14.829561] BUG: non-zero pgtables_bytes on freeing mm: 12288 [ 14.832445] BUG: non-zero pgtables_bytes on freeing mm: 12288 diff --git a/kernel/fork.c b/kernel/fork.c index 5f24f6d68ea4..fcc66acac480 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -688,7 +688,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { - mas_store_gfp(&vmi.mas, NULL, GFP_KERNEL); + if (!strcmp(current->comm, "fork_test") && ktime_get_ns() % 2) { + vmi.mas.node = MA_ERROR(-ENOMEM); + } else { + mas_store_gfp(&vmi.mas, NULL, GFP_KERNEL); + } /* If failed, undo all completed duplications. */ if (unlikely(mas_is_err(&vmi.mas))) { >> >> >>>> >>>>> >>>>>> >>>>>>> + >>>>>>> + mas_set(&vmi.mas, vma->vm_end); >> vma_iter_set(&vmi, vma->vm_end); >>>>>>> + do { >>>>>>> + if (vma->vm_flags & VM_ACCOUNT) >>>>>>> + nr_accounted += vma_pages(vma); >>>>>>> + remove_vma(vma, true); >>>>>>> + count++; >>>>>>> + cond_resched(); >>>>>>> + vma = mas_find(&vmi.mas, tree_end - 1); >>>>>>> + } while (vma != NULL); >> >> You can write this as: >> do { ... } for_each_vma_range(vmi, vma, tree_end); >> >>>>>>> + >>>>>>> + BUG_ON(count != mm->map_count); >>>>>>> + >>>>>>> + vm_unacct_memory(nr_accounted); >>>>>>> + } >>>>>>> + >>>>>>> +destroy: >>>>>>> + __mt_destroy(&mm->mm_mt); >>>>>>> +} >>>>>>> + >>>>>>> /* Release all mmaps. */ >>>>>>> void exit_mmap(struct mm_struct *mm) >>>>>>> { >>>>>>> @@ -3217,7 +3265,7 @@ void exit_mmap(struct mm_struct *mm) >>>>>>> mt_clear_in_rcu(&mm->mm_mt); >>>>>>> mas_set(&mas, vma->vm_end); >>>>>>> free_pgtables(&tlb, &mas, vma, FIRST_USER_ADDRESS, >>>>>>> - USER_PGTABLES_CEILING, true); >>>>>>> + USER_PGTABLES_CEILING, USER_PGTABLES_CEILING, true); >>>>>>> tlb_finish_mmu(&tlb); >>>>>>> /* >>>>>>> -- >>>>>>> 2.20.1 >>>>>>> >>>>>> >>>>> >>>> > --------------0ZF2aWYZwteDjnDjJjSqArOh Content-Type: text/plain; charset=UTF-8; name="fork_test.c" Content-Disposition: attachment; filename="fork_test.c" Content-Transfer-Encoding: base64 I2luY2x1ZGUgPHN0ZGlvLmg+CiNpbmNsdWRlIDxzdGRsaWIuaD4KI2luY2x1ZGUgPHN5cy9t bWFuLmg+CiNpbmNsdWRlIDx1bmlzdGQuaD4KI2luY2x1ZGUgPHN5cy93YWl0Lmg+CgppbnQg bWFpbigpCnsKCWludCBjbnRfc3VjY2VzcyA9IDAsIGNudF9mYWlsdXJlID0gMDsKCWludCBz dGF0dXM7CgoJdm9pZCAqYWRkciA9IG1tYXAoTlVMTCwgNDA5NiwgUFJPVF9SRUFEIHwgUFJP VF9XUklURSwKCQkJICBNQVBfUFJJVkFURSB8IE1BUF9BTk9OWU1PVVMsIC0xLCAwKTsKCWlm IChhZGRyID09IE1BUF9GQUlMRUQpIHsKCQlwZXJyb3IoIm1tYXAgZmFpbGVkIik7CgkJZXhp dCgxKTsKCX0KCWlmIChtcHJvdGVjdChhZGRyLCA0MDk2LCBQUk9UX1JFQUQgfCBQUk9UX1dS SVRFIHwgUFJPVF9FWEVDKSA9PSAtMSkgewoJCXBlcnJvcigibXByb3RlY3QgZmFpbGVkIik7 CgkJZXhpdCgxKTsKCX0KCWlmIChtYWR2aXNlKGFkZHIsIDQwOTYsIE1BRFZfRE9OVEZPUksp ID09IC0xKSB7CgkJcGVycm9yKCJtYWR2aXNlIGZhaWxlZCIpOwoJCWV4aXQoMSk7Cgl9Cglw cmludGYoIlZNQSBjcmVhdGVkIGF0IGFkZHJlc3MgJXBcbiIsIGFkZHIpOwoKCWZvciAoaW50 IGkgPSAwOyBpIDwgMTAwMDA7IGkrKykgewoJCXBpZF90IHBpZCA9IGZvcmsoKTsKCQlpZiAo cGlkID09IC0xKSB7CgkJCWNudF9mYWlsdXJlKys7CgkJfSBlbHNlIGlmIChwaWQgPT0gMCkg ewoJCQlleGl0KEVYSVRfU1VDQ0VTUyk7CgkJfSBlbHNlIHsKCQkJY250X3N1Y2Nlc3MrKzsK CQkJd2FpdCgmc3RhdHVzKTsKCQkJaWYgKHN0YXR1cyAhPSAwKSB7CgkJCQlmcHJpbnRmKHN0 ZGVyciwgIkJhZCB3YWl0IHN0YXR1czogMHgleFxuIiwKCQkJCQlzdGF0dXMpOwoJCQkJZXhp dCgyKTsKCQkJfQoJCX0KCX0KCglwcmludGYoInN1Y2Nlc3M6JWQgZmFpbHVyZTolZFxuIiwg Y250X3N1Y2Nlc3MsIGNudF9mYWlsdXJlKTsKCXJldHVybiAwOwp9Cg== --------------0ZF2aWYZwteDjnDjJjSqArOh--