From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84189C001DC for ; Mon, 31 Jul 2023 12:59:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F160328003C; Mon, 31 Jul 2023 08:59:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EC646280023; Mon, 31 Jul 2023 08:59:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DBE8028003C; Mon, 31 Jul 2023 08:59:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id CBBC6280023 for ; Mon, 31 Jul 2023 08:59:44 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 8D4B1B16A8 for ; Mon, 31 Jul 2023 12:59:44 +0000 (UTC) X-FDA: 81071913888.20.FEFF6C1 Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) by imf02.hostedemail.com (Postfix) with ESMTP id AF8D580003 for ; Mon, 31 Jul 2023 12:59:42 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=dHrCRfHw; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf02.hostedemail.com: domain of zhangpeng.00@bytedance.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=zhangpeng.00@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690808382; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QF4UfSzVoHFhH6+qe8bLS1UKeUs4p2gL1IF9YsMfWkA=; b=dYwgn24PHkdOGquDWIYpxvQQJRcCAKjoUluMQIOdfVw7rLe9kBtXBPmWqqNGPQjVnZMz/Y /x4i+N2aNJHQ9AZlNalOEui0ifHZ3aJ02acMB7mW3xvxfbdMNTwMtGGzzzH3pkyt8w4U/H UPYSWaHPHPclPrd1F/lZSWE23cCr8Vo= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=dHrCRfHw; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf02.hostedemail.com: domain of zhangpeng.00@bytedance.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=zhangpeng.00@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690808382; a=rsa-sha256; cv=none; b=gf440nRs9R8Agtfy5rp2KZRQBgD+ffDjfDcq/autZXXfTwdT2PuMbiPC2t2KnOu4kTFN7S xMdfsaqwC7NQyUg0esv5dzfeFkLpXlpMu98doNm88dHXHmx5h9RRWAtQllfOwENvCIauVF sPFf++LLG5QG3fiaXbP1QnliArtY7qI= Received: by mail-pl1-f172.google.com with SMTP id d9443c01a7336-1bb119be881so37968035ad.3 for ; Mon, 31 Jul 2023 05:59:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1690808381; x=1691413181; h=content-transfer-encoding:in-reply-to:from:cc:references:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=QF4UfSzVoHFhH6+qe8bLS1UKeUs4p2gL1IF9YsMfWkA=; b=dHrCRfHwIUHF4BS2n82py2omFFDqIXCVBCUH8PtJ4HyoxSIodUTkV1K0tLQS7I9xyQ mjBvUi9W4SMymURvNAiMLRkp1v0B1HlBFPP+64/I0aTcnVrjVcZv0GDeHHGOk9UGKeR8 UjsdVJ3Y/jqNIZDFIoEv9kk4k76LcK6c9JuA1jjhAWj8eo/Iw65TJqNkkPLPyto0U0EI XgXb+p9eqrXlyJ2ZsFVMxC1TX0Rx58rbVuOnvU110LavqBCbCFchaWjOaAFHnk8an54O wz8FRShHslT3ylpNtLJC3U5Nvi3XcvfIZq/btNmCmiSokU5bWGqH4EQ7apmPVeB73iLu OIfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690808381; x=1691413181; h=content-transfer-encoding:in-reply-to:from:cc:references:to:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=QF4UfSzVoHFhH6+qe8bLS1UKeUs4p2gL1IF9YsMfWkA=; b=Uuk6cPqc31eoSE5HK/h/SdpcfIo8HBzBe/+LG45wAcJgp8zVVLbxN/hNQNDmtzrse/ ECY1cz69KTEreCWnijb0p1V/r8y6pGerTt3vhxcrgBQG7TH1GYk0PUpP6iJ1+BzykqPI nXqGYKz7+bFws/2i+kSNjbpiuGm1rvfcb+Hw4mniuUsKN41c7miXQKy1d6cFdLZ5dgB+ 4oWry5ppjIVH3VWQ7EpVjqfs/mrGuhS3Rp3DfaX3SFoahSJU6dq1mCJ+TOwT0hIBnaQl hlvQJfh8EC/ft1Z86GyyNXXvJtC+ZMKQK3f6YTscovPzaluBVNuRcy1mm5/Es7FbFjt0 PZ9w== X-Gm-Message-State: ABy/qLY3txVYNwiSLYUKDPtBzmkSpDOm0J10ryUPWj3Of1xJprCR/2nD 1Wq4GbNwTX+zpsAIL+kmV+8hhw== X-Google-Smtp-Source: APBJJlHHG4qXD9pZodOOo0V7rqHfL8gnI1pSgRBecx9tQxQM+G3qqHTm+p53UtVvSWNfoK0EIyRaZA== X-Received: by 2002:a17:902:c952:b0:1b8:8223:8bdd with SMTP id i18-20020a170902c95200b001b882238bddmr11629798pla.59.1690808381254; Mon, 31 Jul 2023 05:59:41 -0700 (PDT) Received: from [10.90.34.137] ([203.208.167.147]) by smtp.gmail.com with ESMTPSA id j6-20020a170902da8600b001bb24cb9a40sm8531225plx.39.2023.07.31.05.59.36 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 31 Jul 2023 05:59:40 -0700 (PDT) Message-ID: Date: Mon, 31 Jul 2023 20:59:33 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.13.1 Subject: Re: [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() To: "Liam R. Howlett" References: <20230726080916.17454-1-zhangpeng.00@bytedance.com> <20230726080916.17454-12-zhangpeng.00@bytedance.com> <20230726170645.2m2rbk325dy727eo@revolver> Cc: linux-mm@kvack.org, avagin@gmail.com, npiggin@gmail.com, mathieu.desnoyers@efficios.com, peterz@infradead.org, michael.christie@oracle.com, surenb@google.com, brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, Peng Zhang , corbet@lwn.net, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org From: Peng Zhang In-Reply-To: <20230726170645.2m2rbk325dy727eo@revolver> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: AF8D580003 X-Stat-Signature: ra6khaqk7disuzduujhrdb9ix14b7tsn X-HE-Tag: 1690808382-137090 X-HE-Meta: U2FsdGVkX19tL2fSMdwQF6aqIMQQ8FpOcw23p6n13CNJUwNzCWSdhtdqWM8wWZzbeTU4NbHO/3JvkworANa4uecD0m+tvHqKQ72qr0vgMsBVpvovUT4hkuI60NRaByoBn9qhgbHWEUornwp6yXJevl1e3JxCabxal5171c/MVOzjOMv+msSPGyqGLo0rDqNanobohNA28cz8TW21u+8bQNrDhFqtAtNezqV0tiDiz1VLKVmBw/Z3NrYCA5k4pbKMjKHHv8YdGHPUX80KFMtspL0V9KDey8NpU9lZ1KbS3Ti/VOHVqLXHrL8iKpOYQzjLuZX3IupNgYDX1V0RhRJ/V39F0aD8WgcMnuQqO9RWeIMjkVRrso5xWmRxwzA+6VTgLIApdt64540TxJlPot5AYHcjLmubV2iqF7NCgJnyT/LjITzidkXFJLttQbMcOwzxdumdR3W9EqdIRUoM5ZUhZNWmje9C3lk51sak518Chn4zaO4K0IzZOXDFtg3XaGcym1v8Lk7h4zepuvH3vJht6BwGbRNwd4RfBnZ1o8KoDBlKKF3DyDR24w426iJ7cPxw6ZWieMb3owsxVhHdSl92+DIFwUSF6Cxq73izx+s+CSDVXg59hMzowM2GFNo5F7Pb+KjuxymdFFkvk60jaTHJeMZEIjnI8AgE6yKaQWMljNlygJwT9DBjdgpY0QJWW78Gzu3AWB44gGySk5L85ifUkf5Pq3p6Qp8se23qmP2PmGkaAYf48CtPloVQPtSnuCzrnX2xbTi9Imxl2I5fMtujcX8ynw2ErHqw6YcAsKBms9aweem/iHJweFZFAeCeJAEPk/sQwdWW7vB/50z7QUgP6ysQgpUPhfqV8clfmJwFX2UQS8zCUmdulejZEZJBrvg6ydLjyoCY2LwbMKDJ1slPQ3aDA4kzjjEoiYzvhYlIRoqthLCMmRSMd/oLFAqp2pUZy8ktWBVGKpV558hIWVm tXD3Lp75 Rb6c7gqvRd9FBgMyif8vmCur/L1SsH4hJgNW5FRthazZwyd9gyfoFnEd8OW4h/yefJeIBjUIZfUd2DW0y++IbXHVefBotTc6mjDuWpkPd3zee7UK3K2iQw7wFLhVVZZob6Gohkx6nM3cNWD+juC+G4XlUBusTp7NF2p2fhePoIBZgUFOf4UVOI03Bi4QcAf0ML/x2QhvuVXQJB4hzSPHOt2vk4UMLQmKiqz4UakYF2tReukonKtFGFM7R9nHpGkAWUyPnEgkM9GBb8FHG4Oe3VqiVZpGJvPsoIo+tS55uimV9sD5154powWa8iZA/CIzolOIQ9Wil4wWQpMwNPaZl0SK3CVRXCriicNI6IQnJrK1fnYzFUfz4RM6xBHzvdts5dwjRx0EX4uTT5SE8urCB8MXHIjgjVXy9vlf6Q4PY0xbmos3Xb5W6qvrxBVjkPUks9lwLe1iyYTrSbun364+o9hoQ0X1BBXbD0SyPc/GhDRI7LbuVIigx8qYw1ZIkyxPrnjXIsEo0Vac+ld67cDguyZhNwLxS8Vf7ioDuOrvcxCfYHyI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: 在 2023/7/27 01:06, Liam R. Howlett 写道: > * Peng Zhang [230726 04:10]: >> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then >> directly modify the entries of VMAs in the new maple tree, which can >> get better performance. dup_mmap() is used by fork(), so this patch >> optimizes fork(). The optimization effect is proportional to the number >> of VMAs. >> >> Due to the introduction of this method, the optimization in >> (maple_tree: add a fast path case in mas_wr_slot_store())[1] no longer >> has an effect here, but it is also an optimization of the maple tree. >> >> There is a unixbench test suite[2] where 'spawn' is used to test fork(). >> 'spawn' only has 23 VMAs by default, so I tweaked the benchmark code a >> bit to use mmap() to control the number of VMAs. Therefore, the >> performance under different numbers of VMAs can be measured. >> >> Insert code like below into 'spawn': >> for (int i = 0; i < 200; ++i) { >> size_t size = 10 * getpagesize(); >> void *addr; >> >> if (i & 1) { >> addr = mmap(NULL, size, PROT_READ, >> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); >> } else { >> addr = mmap(NULL, size, PROT_WRITE, >> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); >> } >> if (addr == MAP_FAILED) >> ... >> } >> >> Based on next-20230721, use 'spawn' under 23, 203, and 4023 VMAs, test >> 4 times in 30 seconds each time, and get the following numbers. These >> numbers are the number of fork() successes in 30s (average of the best >> 3 out of 4). By the way, based on next-20230725, I reverted [1], and >> tested it together as a comparison. In order to ensure the reliability >> of the test results, these tests were run on a physical machine. >> >> 23VMAs 223VMAs 4023VMAs >> revert [1]: 159104.00 73316.33 6787.00 > > You can probably remove the revert benchmark from this since there is no > reason to revert the previous change. The change is worth while on its > own, so it's better to have the numbers more clear by having with and > without this series. I will remove it. > >> >> +0.77% +0.42% +0.28% >> next-20230721: 160321.67 73624.67 6806.33 >> >> +2.77% +15.42% +29.86% >> apply this: 164751.67 84980.33 8838.67 > > What is the difference between using this patch with mas_replace_entry() > and mas_store_entry()? I haven't tested and compared them yet, I will compare them when I have time. It may be compared by simulating fork() in user space. > >> >> It can be seen that the performance improvement is proportional to >> the number of VMAs. With 23 VMAs, performance improves by about 3%, >> with 223 VMAs, performance improves by about 15%, and with 4023 VMAs, >> performance improves by about 30%. >> >> [1] https://lore.kernel.org/lkml/20230628073657.75314-4-zhangpeng.00@bytedance.com/ >> [2] https://github.com/kdlucas/byte-unixbench/tree/master >> >> Signed-off-by: Peng Zhang >> --- >> kernel/fork.c | 35 +++++++++++++++++++++++++++-------- >> mm/mmap.c | 14 ++++++++++++-- >> 2 files changed, 39 insertions(+), 10 deletions(-) >> >> diff --git a/kernel/fork.c b/kernel/fork.c >> index f81149739eb9..ef80025b62d6 100644 >> --- a/kernel/fork.c >> +++ b/kernel/fork.c >> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >> int retval; >> unsigned long charge = 0; >> LIST_HEAD(uf); >> - VMA_ITERATOR(old_vmi, oldmm, 0); >> VMA_ITERATOR(vmi, mm, 0); >> >> uprobe_start_dup_mmap(); >> @@ -678,17 +677,40 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >> goto out; >> khugepaged_fork(mm, oldmm); >> >> - retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count); >> - if (retval) >> + /* Use __mt_dup() to efficiently build an identical maple tree. */ >> + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN); >> + if (unlikely(retval)) >> goto out; >> >> mt_clear_in_rcu(vmi.mas.tree); >> - for_each_vma(old_vmi, mpnt) { >> + for_each_vma(vmi, mpnt) { >> struct file *file; >> >> vma_start_write(mpnt); >> if (mpnt->vm_flags & VM_DONTCOPY) { >> vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); >> + >> + /* >> + * Since the new tree is exactly the same as the old one, >> + * we need to remove the unneeded VMAs. >> + */ >> + mas_store(&vmi.mas, NULL); >> + >> + /* >> + * Even removing an entry may require memory allocation, >> + * and if removal fails, we use XA_ZERO_ENTRY to mark >> + * from which VMA it failed. The case of encountering >> + * XA_ZERO_ENTRY will be handled in exit_mmap(). >> + */ >> + if (unlikely(mas_is_err(&vmi.mas))) { >> + retval = xa_err(vmi.mas.node); >> + mas_reset(&vmi.mas); >> + if (mas_find(&vmi.mas, ULONG_MAX)) >> + mas_replace_entry(&vmi.mas, >> + XA_ZERO_ENTRY); >> + goto loop_out; >> + } >> + >> continue; >> } >> charge = 0; >> @@ -750,8 +772,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >> hugetlb_dup_vma_private(tmp); >> >> /* Link the vma into the MT */ >> - if (vma_iter_bulk_store(&vmi, tmp)) >> - goto fail_nomem_vmi_store; >> + mas_replace_entry(&vmi.mas, tmp); >> >> mm->map_count++; >> if (!(tmp->vm_flags & VM_WIPEONFORK)) >> @@ -778,8 +799,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >> uprobe_end_dup_mmap(); >> return retval; >> >> -fail_nomem_vmi_store: >> - unlink_anon_vmas(tmp); >> fail_nomem_anon_vma_fork: >> mpol_put(vma_policy(tmp)); >> fail_nomem_policy: >> diff --git a/mm/mmap.c b/mm/mmap.c >> index bc91d91261ab..5bfba2fb0e39 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -3184,7 +3184,11 @@ void exit_mmap(struct mm_struct *mm) >> arch_exit_mmap(mm); >> >> vma = mas_find(&mas, ULONG_MAX); >> - if (!vma) { >> + /* >> + * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY, >> + * xa_is_zero(vma) may be true. >> + */ >> + if (!vma || xa_is_zero(vma)) { >> /* Can happen if dup_mmap() received an OOM */ >> mmap_read_unlock(mm); >> return; >> @@ -3222,7 +3226,13 @@ void exit_mmap(struct mm_struct *mm) >> remove_vma(vma, true); >> count++; >> cond_resched(); >> - } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL); >> + vma = mas_find(&mas, ULONG_MAX); >> + /* >> + * If xa_is_zero(vma) is true, it means that subsequent VMAs >> + * donot need to be removed. Can happen if dup_mmap() fails to >> + * remove a VMA marked VM_DONTCOPY. >> + */ >> + } while (vma != NULL && !xa_is_zero(vma)); >> >> BUG_ON(count != mm->map_count); >> >> -- >> 2.20.1 >>