From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE977EE6457 for ; Fri, 15 Sep 2023 10:57:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 479366B0349; Fri, 15 Sep 2023 06:57:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4297F6B034A; Fri, 15 Sep 2023 06:57:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 318316B034B; Fri, 15 Sep 2023 06:57:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 236806B0349 for ; Fri, 15 Sep 2023 06:57:08 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id E029D140BBB for ; Fri, 15 Sep 2023 10:57:07 +0000 (UTC) X-FDA: 81238529694.19.12DFB1A Received: from mail-pg1-f175.google.com (mail-pg1-f175.google.com [209.85.215.175]) by imf16.hostedemail.com (Postfix) with ESMTP id 0244C180024 for ; Fri, 15 Sep 2023 10:57:05 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=GyFGYQ1n; spf=pass (imf16.hostedemail.com: domain of zhangpeng.00@bytedance.com designates 209.85.215.175 as permitted sender) smtp.mailfrom=zhangpeng.00@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694775426; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CywZ7OA8fgjOBQxRSPuLBC9L65bR5R5MQRRx+ZbrHCo=; b=YPSZ2yp1bX3tBxO/WpzwybKf//b+mGm5lu6JRQktcQwhfPvKtSvgG7syr8IiYzxQzdUBSx lbmiFpoSa8UYtmRARQEa81rSXg+XbSak5pmRUH3VLvTuwXWdE/LtCh1VlkJbxpH+Nht9la cNuZnLs9BzEKPYxH/LLjCm4YiXABmz8= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=GyFGYQ1n; spf=pass (imf16.hostedemail.com: domain of zhangpeng.00@bytedance.com designates 209.85.215.175 as permitted sender) smtp.mailfrom=zhangpeng.00@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694775426; a=rsa-sha256; cv=none; b=WWCyVVDtzvxaXJaC3Zawjk67KqsetfRrELzbpoR5iOccecvtObdvtrqmYrlBOVvCiwwfJt Yhg3swhc499jIQMDk+PIpPNeWu99+SAvjTCybItpWTsPfi7KF8zR78vajdubr7qffSiItT Zum+vi3/MEkctC/RpEhvoxy6aKpXQtg= Received: by mail-pg1-f175.google.com with SMTP id 41be03b00d2f7-5780040cb81so1256266a12.1 for ; Fri, 15 Sep 2023 03:57:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1694775424; x=1695380224; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=CywZ7OA8fgjOBQxRSPuLBC9L65bR5R5MQRRx+ZbrHCo=; b=GyFGYQ1nLLlLXZFaO5q7WYPerA+BqAfPz8PbP9haPz5/iLRmy153SjZ+IF81YxY7Yx 576Fvs8IXXdPjXydDomE0EpTEF8dzXbQcb2jp14MDAy1YB8v/Rn2r5Kp2zvMTZJG9ghf tsxr95OSlMnfdAXeJlO3Gio4L/8RbA9iAEJa5df83p21gvoDKx9HlF2vODz5Y2PnfvaD X+BWVxnNaous2hZuVXffP5EEqPRPonylfWCmbXi7WmhrOd8ZQFYmEkk2rqmVKXKJaLJa F4ZJKbl+qDi3KjYIS4lGxQbQlhRDeN8RWbpMDmKXmvaZmvr51zQjCaNeuwku+MBS4RVM JdHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694775424; x=1695380224; h=content-transfer-encoding:in-reply-to:from:references:to:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=CywZ7OA8fgjOBQxRSPuLBC9L65bR5R5MQRRx+ZbrHCo=; b=P8M/BvnILAatPQ+pL0X8iyrYJLDK5oLGHuICxS3BCtgjSOo7IdDx5P6mlcmbrRGM4U 9smfFHJM1hqIv8eYrnGv6D3RC0AP/3gERMZ8XNctNSQF6O7IhaJQ1FHBThgbH3uK2LBd n3l3N1v77fIsn/Kx28Ue4qIuKDKCPk7U7CjPDdSRnHM7DYHCyhUDyE7YKqhM36b3KCG1 sT6ONoBFQFAV7j6xzq1Kh1hwK+uJAXChmDc4Z7dlj95SXTthHIi7xpQJ4FUQ8lga5hvE 4/sHwL6HB+Knyrlh0FZ7NiGnD23edngXm7NOYtrXhgmVf2FbxCluofaN1aiy5NYGvRFu 2OhA== X-Gm-Message-State: AOJu0Yz1lDmtstxJij64Mv5S2QBLCK9h3KdpkfXook6e7Q+H2hyFzhW7 iYS3DgAhgKYczmDfmBzGsPxTKA== X-Google-Smtp-Source: AGHT+IHny0QUFCpkCzgyzJpqhQP0azpMBoR5qTMFF25rhIdyubTcZgmChMaH8KyXVa9NUxjm3W9qaw== X-Received: by 2002:a05:6a20:8f29:b0:14c:a2e1:65ec with SMTP id b41-20020a056a208f2900b0014ca2e165ecmr1704416pzk.38.1694775424392; Fri, 15 Sep 2023 03:57:04 -0700 (PDT) Received: from [10.254.225.85] ([139.177.225.243]) by smtp.gmail.com with ESMTPSA id h18-20020a62b412000000b0066684d8115bsm2840877pfn.178.2023.09.15.03.56.58 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 15 Sep 2023 03:57:04 -0700 (PDT) Message-ID: <377b1e27-5752-ee04-b568-69b697852228@bytedance.com> Date: Fri, 15 Sep 2023 18:56:57 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() To: Peng Zhang , "Liam R. Howlett" , corbet@lwn.net, akpm@linux-foundation.org, willy@infradead.org, brauner@kernel.org, surenb@google.com, michael.christie@oracle.com, peterz@infradead.org, mathieu.desnoyers@efficios.com, npiggin@gmail.com, avagin@gmail.com, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org References: <20230830125654.21257-1-zhangpeng.00@bytedance.com> <20230830125654.21257-7-zhangpeng.00@bytedance.com> <20230907201414.dagnqxfnu7f7qzxd@revolver> <2868f2d5-1abe-7af6-4196-ee53cfae76a9@bytedance.com> From: Peng Zhang In-Reply-To: <2868f2d5-1abe-7af6-4196-ee53cfae76a9@bytedance.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 0244C180024 X-Rspam-User: X-Stat-Signature: dydhhei8a1qp4zwa7mijeya48nzbj8q4 X-Rspamd-Server: rspam01 X-HE-Tag: 1694775425-85882 X-HE-Meta: U2FsdGVkX1/xe4duemrwoJUdEHYyFhA5ukgfRK/kHODUiJ84i2ZAahdEA0Qt7HjQhRsNWiHtmVaMIoNJX56QdjdWBcM9tC3jczknT5XlbtzVPTShEBqFGPjAn9u7fjJtpEf7qXY80rrTEhb9SQT5mFPu/elAXZdfyUC8anXOjf+XKjELPxlWMGCnw81s9q0sVLbK7dbdhkVeAnXLhxvbIARbcnmtNoeom2TLkxAlebTx6xpMIBDRraFvDsfFyEMOI8YOCvWUjFSkVBm6ieJsVyr78H1hNBJwRonhtQFNLzIMfJX3t4FBWIEG5YoHHDY8IM0hUljxtWCW/u0JpjyuhDl+5ytbz48uaZazDgtndhs/+nnCTceFvWQPpV3RMxC0ouls0I9FlFV2RgSxk2tE52fioR/QQK1AIzg/Inuc9QlaFwkiVjdPjiU2MQf7QjvnDCS3GHqe4hSpLU6vyi+v4gNHk/KBl/VHUmTY5Gbbs+XvgN41KfpZFjFs+bGRi2iJmDeVhZvG0Yn8a7bnvmv4VgOe6GO7fCiyj9QGfj+HCC+sBZQFdEJHhrNcxo9Vtc8SKtKSprl+61mtyMmYLWX3Q6F9RSRi4BxGdrCGNmwfuGXTtlQucFE4AbvcaWEjZNjdGJdfbYfPj/j2SOHRlTsnGdZMjQeT/pmO4mUrVM686/7FotxKbFt2SivUiwAzPT0Dip8v3UVCp8pgm27HpkGy1stgNEXv91ys0dZSE/WCxvwPR+LfyKlajk+o4P18Yk8/71ktAeRXa40Y1gs2ypxwoCq7Q/1XWdds8LPjbfwHee+Yxx6bXj/SsY9B6QQacHcl/VL4UpVT0fqv0Jff6fGXXFWqMGyRkOL4DmmJ8k2B1qa9bgjkEt4ky/I6RaN14JW+phq2/nCVD93TRPO03ypNZc454kNEzoeAbj1dm6IOugIalRZZaDSVc/Bt7LI6EFl0i0rLbnf3mqV8dqaOYfg mZyrTQWx 4Eum/AWX/iEMUJJHkIsobJL6UZTBot5rE1fuzEBTv2ZMj5svNDsF4G5HnMTQ/QIRiNLnovsTcpGdieubCkig7PTbrvJJ8k7hwWY3VgoGlcpmSV5EmKXAP7U0Iz1Y9i9KSa0XeIgQBoDNH1h1QHDAYL2q6cdu3SL/+EBX4VKU27/Hp4VCpC+zVx4/V2zfKL3zk3HsvJS61r5NVuqk24zraVy6f4F8cZ7qkzvg28IUtrLv5JPohO8Y3QApY1ROCFb7sXsCHH3YpNLPcTAnH++B2fTJ3hGPGxRjeGcgB0ehZ1vF6xaRsI/1uaDWcIcqaNl0XBL3TeFlEFIpdy1x/5lpo+Y8IAVtHKJx7g0A2mBsukP0nKzUzJrMRPzUjiV41b32ESyS9EwajHrkh5e4dsTHZsPKE1TiN9ziPAdqmFbbjI8LIUsbqp8fQPOGoQC8y5VinrBPW80matYihCa0yCt4rJr7qrZIK8P66wtPPYrkZJX5SwLd+TloPcrDvIC/Gkfbcq7/MYRNNYcuCq/C0qkhuEa69IQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: 在 2023/9/15 18:51, Peng Zhang 写道: > > > 在 2023/9/8 04:14, Liam R. Howlett 写道: >> * Peng Zhang [230830 08:58]: >>> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then >>> directly modify the entries of VMAs in the new maple tree, which can >>> get better performance. The optimization effect is proportional to the >>> number of VMAs. >>> >>> There is a "spawn" in byte-unixbench[1], which can be used to test the >>> performance of fork(). I modified it slightly to make it work with >>> different number of VMAs. >>> >>> Below are the test numbers. There are 21 VMAs by default. The first row >>> indicates the number of added VMAs. The following two lines are the >>> number of fork() calls every 10 seconds. These numbers are different >>> from the test results in v1 because this time the benchmark is bound to >>> a CPU. This way the numbers are more stable. >>> >>>    Increment of VMAs: 0      100     200     400     800     1600 >>> 3200    6400 >>> 6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317 >>> 6110    3158 >>> Apply this patchset: 114531 85420   64541   44592   28660   16371 >>> 9038    4831 >>>                       +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% >>> +47.92% +52.98% >> >> Thanks! >> >> Can you include 21 in this table since it's the default? >> >>> >>> [1] https://github.com/kdlucas/byte-unixbench/tree/master >>> >>> Signed-off-by: Peng Zhang >>> --- >>>   kernel/fork.c | 34 ++++++++++++++++++++++++++-------- >>>   mm/mmap.c     | 14 ++++++++++++-- >>>   2 files changed, 38 insertions(+), 10 deletions(-) >>> >>> diff --git a/kernel/fork.c b/kernel/fork.c >>> index 3b6d20dfb9a8..e6299adefbd8 100644 >>> --- a/kernel/fork.c >>> +++ b/kernel/fork.c >>> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct >>> mm_struct *mm, >>>       int retval; >>>       unsigned long charge = 0; >>>       LIST_HEAD(uf); >>> -    VMA_ITERATOR(old_vmi, oldmm, 0); >>>       VMA_ITERATOR(vmi, mm, 0); >>>       uprobe_start_dup_mmap(); >>> @@ -678,17 +677,39 @@ static __latent_entropy int dup_mmap(struct >>> mm_struct *mm, >>>           goto out; >>>       khugepaged_fork(mm, oldmm); >>> -    retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count); >>> -    if (retval) >>> +    /* Use __mt_dup() to efficiently build an identical maple tree. */ >>> +    retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | >>> __GFP_NOWARN); >> >> Apparently the flags should be GFP_KERNEL here so that compaction can >> run. >> >>> +    if (unlikely(retval)) >>>           goto out; >>>       mt_clear_in_rcu(vmi.mas.tree); >>> -    for_each_vma(old_vmi, mpnt) { >>> +    for_each_vma(vmi, mpnt) { >>>           struct file *file; >>>           vma_start_write(mpnt); >>>           if (mpnt->vm_flags & VM_DONTCOPY) { >>>               vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); >>> + >>> +            /* >>> +             * Since the new tree is exactly the same as the old one, >>> +             * we need to remove the unneeded VMAs. >>> +             */ >>> +            mas_store(&vmi.mas, NULL); >>> + >>> +            /* >>> +             * Even removing an entry may require memory allocation, >>> +             * and if removal fails, we use XA_ZERO_ENTRY to mark >>> +             * from which VMA it failed. The case of encountering >>> +             * XA_ZERO_ENTRY will be handled in exit_mmap(). >>> +             */ >>> +            if (unlikely(mas_is_err(&vmi.mas))) { >>> +                retval = xa_err(vmi.mas.node); >>> +                mas_reset(&vmi.mas); >>> +                if (mas_find(&vmi.mas, ULONG_MAX)) >>> +                    mas_store(&vmi.mas, XA_ZERO_ENTRY); >>> +                goto loop_out; >>> +            } >>> + >> >> Storing NULL may need extra space as you noted, so we need to be careful >> what happens if we don't have that space.  We should have a testcase to >> test this scenario. >> >> mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL >> in this function, see vm_area_dup(). >> >> Don't use the exit_mmap() path to undo a failed fork.  You've added >> checks and complications to the exit path for all tasks in the very >> unlikely event that we run out of memory when we hit a very unlikely >> VM_DONTCOPY flag. >> >> I see the issue with having a portion of the tree with new VMAs that are >> accounted and a portion of the tree that has old VMAs that should not be >> looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but >> we cannot add that complication to the exit path and then there is the >> OOM race to worry about (maybe, I am not sure since this MM isn't >> active yet). > I encountered some errors after implementing the scheme you mentioned > below. This would also clutter fork.c and mmap.c, as some internal > functions would need to be made global. > > I thought of another way to put everything into maple tree. In non-RCU > mode, we can remove the last half of the tree without allocating any > memory. This requires modifications to the internal implementation of > mas_store(). > Then remove the second half of the tree like this: > > mas.index = 0; Sorry, typo. Change to: mas.index = vma->start > mas.last = ULONGN_MAX; > mas_store(&mas, NULL). > > At least in non-RCU mode, we can do this, since we only need to merge > some nodes, or move some items to adjacent nodes. > However, this will increase the workload significantly. > >> >> Using what is done in exit_mmap() and do_vmi_align_munmap() as a >> prototype, we can do something like the *untested* code below: >> >> if (unlikely(mas_is_err(&vmi.mas))) { >>     unsigned long max = vmi.index; >> >>     retval = xa_err(vmi.mas.node); >>     mas_set(&vmi.mas, 0); >>     tmp = mas_find(&vmi.mas, ULONG_MAX); >>     if (tmp) { /* Not the first VMA failed */ >>         unsigned long nr_accounted = 0; >> >>         unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max, >>                 true); >>         do { >>             if (vma->vm_flags & VM_ACCOUNT) >>                 nr_accounted += vma_pages(vma); >>             remove_vma(vma, true); >>             cond_resched(); >>             vma = mas_find(&vmi.mas, max - 1); >>         } while (vma != NULL); >> >>         vm_unacct_memory(nr_accounted); >>     } >>     __mt_destroy(&mm->mm_mt); >>     goto loop_out; >> } >> >> Once exit_mmap() is called, the check for OOM (no vma) will catch that >> nothing is left to do. >> >> It might be worth making an inline function to do this to keep the fork >> code clean.  We should test this by detecting a specific task name and >> returning a failure at a given interval: >> >> if (!strcmp(current->comm, "fork_test") { >> ... >> } >> >> >>>               continue; >>>           } >>>           charge = 0; >>> @@ -750,8 +771,7 @@ static __latent_entropy int dup_mmap(struct >>> mm_struct *mm, >>>               hugetlb_dup_vma_private(tmp); >>>           /* Link the vma into the MT */ >>> -        if (vma_iter_bulk_store(&vmi, tmp)) >>> -            goto fail_nomem_vmi_store; >>> +        mas_store(&vmi.mas, tmp); >>>           mm->map_count++; >>>           if (!(tmp->vm_flags & VM_WIPEONFORK)) >>> @@ -778,8 +798,6 @@ static __latent_entropy int dup_mmap(struct >>> mm_struct *mm, >>>       uprobe_end_dup_mmap(); >>>       return retval; >>> -fail_nomem_vmi_store: >>> -    unlink_anon_vmas(tmp); >>>   fail_nomem_anon_vma_fork: >>>       mpol_put(vma_policy(tmp)); >>>   fail_nomem_policy: >>> diff --git a/mm/mmap.c b/mm/mmap.c >>> index b56a7f0c9f85..dfc6881be81c 100644 >>> --- a/mm/mmap.c >>> +++ b/mm/mmap.c >>> @@ -3196,7 +3196,11 @@ void exit_mmap(struct mm_struct *mm) >>>       arch_exit_mmap(mm); >>>       vma = mas_find(&mas, ULONG_MAX); >>> -    if (!vma) { >>> +    /* >>> +     * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY, >>> +     * xa_is_zero(vma) may be true. >>> +     */ >>> +    if (!vma || xa_is_zero(vma)) { >>>           /* Can happen if dup_mmap() received an OOM */ >>>           mmap_read_unlock(mm); >>>           return; >>> @@ -3234,7 +3238,13 @@ void exit_mmap(struct mm_struct *mm) >>>           remove_vma(vma, true); >>>           count++; >>>           cond_resched(); >>> -    } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL); >>> +        vma = mas_find(&mas, ULONG_MAX); >>> +        /* >>> +         * If xa_is_zero(vma) is true, it means that subsequent VMAs >>> +         * donot need to be removed. Can happen if dup_mmap() fails to >>> +         * remove a VMA marked VM_DONTCOPY. >>> +         */ >>> +    } while (vma != NULL && !xa_is_zero(vma)); >>>       BUG_ON(count != mm->map_count); >>> -- >>> 2.20.1 >>> >>