From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2DC77C46CA1 for ; Mon, 18 Sep 2023 13:14:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9BFB46B032F; Mon, 18 Sep 2023 09:14:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 96EBD6B0331; Mon, 18 Sep 2023 09:14:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 837416B0336; Mon, 18 Sep 2023 09:14:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 76ADF6B032F for ; Mon, 18 Sep 2023 09:14:55 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 499F814027A for ; Mon, 18 Sep 2023 13:14:55 +0000 (UTC) X-FDA: 81249763350.17.D5FAC84 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) by imf10.hostedemail.com (Postfix) with ESMTP id 767A4C0027 for ; Mon, 18 Sep 2023 13:14:51 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=hZ3mKiax; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf10.hostedemail.com: domain of zhangpeng.00@bytedance.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=zhangpeng.00@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695042892; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QpekWC9RDoPaptS5D3LgfCtgsKJPqxE0CvywzGx4FOY=; b=tR9fJzqvNyC5SFm7b2m5zR90InxtAesneMQV3B34KyY2kPftHD7aKBTtsvbZPliOZDL8pL jZEqoLqce/klbcwHh24hIwbE043ymAlr+4JKTuXxz7XPPJTFVODHKmj9DuSTI6ENs8Knm7 2qsT0AznV6aiUefo2hwtSpGrDi421VQ= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=hZ3mKiax; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf10.hostedemail.com: domain of zhangpeng.00@bytedance.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=zhangpeng.00@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695042892; a=rsa-sha256; cv=none; b=RIEYAUZ3pWsnFq9n2E/PSeHb+ewIPqFgUcNK+7m7NktfUg/TqhoSdTLD8U/2QnEbu9bGVx P6BxIVM78GaRO7IH0bvG09AftHCaRb3xRn3BJXvCjGGuiDLlZpdOmy4VEJHa81sNYk1XQa gd24kYmGJ2DbqadhGuVUlbnOFNigx0Q= Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-1c451541f23so9965415ad.2 for ; Mon, 18 Sep 2023 06:14:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1695042890; x=1695647690; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=QpekWC9RDoPaptS5D3LgfCtgsKJPqxE0CvywzGx4FOY=; b=hZ3mKiax2HIrNX1JJ4raL5rhBc2xBRvD00vURkOtpvit3AhYTGTspqSzlY0i5/dKft xv70ioWOqdzzcYUYeXf7/gn+xBhJxGqttSB66k+DmyWb47KQA0HHBfVSMHjeVjWKC8vM V5GFxGZoBdwEsk+Yu61sb+Ojz5XbPrM0ojPeMt7pLfEWJHMh3owSjt50IGhFh3Rmh48l 4fOhdY/lQJQU5teect3xmNgatTaXuMYC4KMnCRy75NPUljBzk3LNPY35sBl++vGvKTUY 3yAhHmlQ5Vw6nqg3BJs1/g45xUkP66BSzP7nbCbxN0sOYwZbgbWXQ2KruDUcx3zlgr4m mDHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695042890; x=1695647690; h=content-transfer-encoding:in-reply-to:from:references:to:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=QpekWC9RDoPaptS5D3LgfCtgsKJPqxE0CvywzGx4FOY=; b=d9yGiukhXUsq1QwaRDxUkF9/VKwmf4yt95Y5lE5X/iH8HjSah3lXGvsOJkXylMG4QR o3nHHT1rPNN2I+CIFXykpJqM6rTfLLSp6fIPnF1Nf0+9ud4Krg9HcHmA1azdBeSgf/tS qdSvNAS6ypfWGEgC08q4AjgNfXcHi89fKCT+qVLWgdp+d56Hmt+/YMNFAQYuqoLeBNIH bUje39e8cI59B8wr2u9pBnXIkXe4xw0PKRsATVScjAa6jdSUqWzX5J3Bbpg3q/OXQNkn g0EaynJu6c18yjsnr71KnUsvhPD4hSPbYzfSFpas/Ou5RwRusekaH4Z1DBK9m7X3WSWQ 3mEg== X-Gm-Message-State: AOJu0YwTUa2w2a9EQm8sZxjQbOyl0CK2A6G5hsqZ+OTQh7NfTLQ7vfI8 6IDBbLtjsvEX/NRkjpkue9QEGQ== X-Google-Smtp-Source: AGHT+IG19Ts0EgOuzI/WqXVCrvF4OpJ7JCaLRhVWQBgeNTxF4XOA9vmkEQ2/CAfIf6KkmuJkIwYS+g== X-Received: by 2002:a17:902:728a:b0:1c4:152a:496c with SMTP id d10-20020a170902728a00b001c4152a496cmr7514885pll.19.1695042889984; Mon, 18 Sep 2023 06:14:49 -0700 (PDT) Received: from [10.254.225.85] ([139.177.225.243]) by smtp.gmail.com with ESMTPSA id b7-20020a170902d50700b001a80ad9c599sm8276929plg.294.2023.09.18.06.14.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 18 Sep 2023 06:14:49 -0700 (PDT) Message-ID: <7f0db16b-609c-7504-bd5b-52a76e0200b3@bytedance.com> Date: Mon, 18 Sep 2023 21:14:42 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() To: "Liam R. Howlett" , Peng Zhang , corbet@lwn.net, akpm@linux-foundation.org, willy@infradead.org, brauner@kernel.org, surenb@google.com, michael.christie@oracle.com, peterz@infradead.org, mathieu.desnoyers@efficios.com, npiggin@gmail.com, avagin@gmail.com, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org References: <20230830125654.21257-1-zhangpeng.00@bytedance.com> <20230830125654.21257-7-zhangpeng.00@bytedance.com> <20230907201414.dagnqxfnu7f7qzxd@revolver> <2868f2d5-1abe-7af6-4196-ee53cfae76a9@bytedance.com> <377b1e27-5752-ee04-b568-69b697852228@bytedance.com> <20230915200040.ebrri6zy3uccndx2@revolver> From: Peng Zhang In-Reply-To: <20230915200040.ebrri6zy3uccndx2@revolver> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 767A4C0027 X-Stat-Signature: su9op5gi4bu168k6kf9ska5f5cmmd9nw X-Rspam-User: X-HE-Tag: 1695042891-98812 X-HE-Meta: U2FsdGVkX1/rZnvzongJZX0lvR14hwWlriEPQwvXzHwa+gMH9Q/qnk4ue22AmGz+V+bfdeDfdmGA8ply7YRYSid5NUA1PfmsFdtsShWxd82O/SJHgJaJLgd82ulPiDdnQDM1DpDttXe55D90rW8ZAeHyTxyd4mQZbhtdzjN4EAUwXsJSpuMH64GcopKUM9/Hs6w6tEOGVbvjRcQNwxwW/HNeV3p7uz2RkaDUJKPpEYuErLP75pG2jXqrmYLEvu75AXc86FsvYlXomE0JDWiEkKtknthuSK+nOJXZMvLyMHDPAQepyncEVLruxYSOrr5+/klwOdfVE+wZLu9O+6LIGn8lPOEz0erkR5rq7qjA70WLv8qDLX17n7Cq9AeievRkm+JDCoOm7RALA5EHT7J76nhu+noRWfsp2mFqrGugT2I5Z7Th67q93YAeCio8IRk/v4PPMMNaCejKJ1TUWM+LUwYtG9ISZ8wVHaQXmSyvVZoq+O4PjpbEmMw3DORsvtlzSoW77qAnwSv/A/Mwi285AUxRxr8/MGEX2IDT/WhL5s2Yeyn+2NcXjQ33H1lOWWcWOb0Re46XagLlr4FSpdas/WiSzL7oHeqOM4NBzhlDpmnw3AvF96jAEvrEg9ICeyCMRBS85cCrYzKOnkSm7fNlCwVrAD7O0qCeyPCSsnYz7g8V6q6GNbEp3Rb+B9KEZ3/s1ozIi4IIRSoK6h96Yf1jS8tPfFyzNMg1egcQO1p1JVVdKxEV3PprO1IynMyeWsOI1H1COio5PkHiC7/6ewWmaYZguwS5FC40VGdtY9n8BTA89+SDBhC7lBxPYST4zBwQYuBfwPnkSn2MV6hyBzk6kzRABE4dcMFG3BFKzv/4wrQni+bEjrPaz0qkxLSo7siKlqsluG1yz0tclv3TuxjaBpqppCIFZyAvyAmIny1YnBBP42oiktDgAUG8qkrnjoQQUs3lTC38OnfS/A0xx+r 9zZ3DpJk IRAd51NOIGnUA55V/fvWh1WpVXX1Ojud4tDK0biQu9vGoT5N0ytdztbL+UBBmTS4cr4DF8ZEpGms0ur6MGwmTKbLA+ZFtDcnWzI+nk3A5JjStZWPrv8Mppano9HsjgoXQQClb1ATK+3sg0BgFUciyK3QX6u5TU/cLRpiOTzC3UHseW1HCK8UvbI+a6jnw64nTpD+298TgOTHyjJxdsrfldaN8pWmQf8EysUB326T0zpth5//uYqoFwp4puyLrIV8He+vNqrz5m8vmOCt73PbfTafhEZMm27Ftt+jX5AJLluVbdGyYA0ZpcXsDDDOHgmFDDEUKep4Z+Mzrgrvfwwl3P6xRP2CtERrgdtPXrjnkTpVQB7urrJoS5uVSYNDIHXOYKC0fklLMCllrDs85nuGm25Ruxl2IjBOUr4st2ZRJt/f7KhSQLbzurMiJZA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: 在 2023/9/16 04:00, Liam R. Howlett 写道: > * Peng Zhang [230915 06:57]: >> >> > > ... > >>>>> +    if (unlikely(retval)) >>>>>           goto out; >>>>>       mt_clear_in_rcu(vmi.mas.tree); >>>>> -    for_each_vma(old_vmi, mpnt) { >>>>> +    for_each_vma(vmi, mpnt) { >>>>>           struct file *file; >>>>>           vma_start_write(mpnt); >>>>>           if (mpnt->vm_flags & VM_DONTCOPY) { >>>>>               vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); >>>>> + >>>>> +            /* >>>>> +             * Since the new tree is exactly the same as the old one, >>>>> +             * we need to remove the unneeded VMAs. >>>>> +             */ >>>>> +            mas_store(&vmi.mas, NULL); >>>>> + >>>>> +            /* >>>>> +             * Even removing an entry may require memory allocation, >>>>> +             * and if removal fails, we use XA_ZERO_ENTRY to mark >>>>> +             * from which VMA it failed. The case of encountering >>>>> +             * XA_ZERO_ENTRY will be handled in exit_mmap(). >>>>> +             */ >>>>> +            if (unlikely(mas_is_err(&vmi.mas))) { >>>>> +                retval = xa_err(vmi.mas.node); >>>>> +                mas_reset(&vmi.mas); >>>>> +                if (mas_find(&vmi.mas, ULONG_MAX)) >>>>> +                    mas_store(&vmi.mas, XA_ZERO_ENTRY); >>>>> +                goto loop_out; >>>>> +            } >>>>> + >>>> >>>> Storing NULL may need extra space as you noted, so we need to be careful >>>> what happens if we don't have that space.  We should have a testcase to >>>> test this scenario. >>>> >>>> mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL >>>> in this function, see vm_area_dup(). >>>> >>>> Don't use the exit_mmap() path to undo a failed fork.  You've added >>>> checks and complications to the exit path for all tasks in the very >>>> unlikely event that we run out of memory when we hit a very unlikely >>>> VM_DONTCOPY flag. >>>> >>>> I see the issue with having a portion of the tree with new VMAs that are >>>> accounted and a portion of the tree that has old VMAs that should not be >>>> looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but >>>> we cannot add that complication to the exit path and then there is the >>>> OOM race to worry about (maybe, I am not sure since this MM isn't >>>> active yet). >>> I encountered some errors after implementing the scheme you mentioned >>> below. > > What were the errors? Maybe I missed something or there is another way. I found the cause of the problem and fixed it, tested the error path and it seems to be working fine now. The reason is that "free_pgd_range(tlb, addr, vma->vm_end,floor, next? next->vm_start: ceiling);" in free_pgtables() does not free all page tables due to the existence of the last false VMA. I've fixed it. Thanks. > >>> This would also clutter fork.c and mmap.c, as some internal >>> functions would need to be made global. > > Could it not be a new function in mm/mmap.c and added to mm/internal.h > that does the accounting and VMA freeing from [0 - vma->vm_start)? > > Maybe we could use it in the other areas that do this sort of work? > do_vmi_align_munmap() does something similar to what we need after the > "Point of no return". > >>> >>> I thought of another way to put everything into maple tree. In non-RCU >>> mode, we can remove the last half of the tree without allocating any >>> memory. This requires modifications to the internal implementation of >>> mas_store(). >>> Then remove the second half of the tree like this: >>> >>> mas.index = 0; >> Sorry, typo. >> Change to: mas.index = vma->start >>> mas.last = ULONGN_MAX; >>> mas_store(&mas, NULL). > > Well, we know we are not in RCU mode here, but I am concerned about this > going poorly. > >> >>> >>> At least in non-RCU mode, we can do this, since we only need to merge >>> some nodes, or move some items to adjacent nodes. >>> However, this will increase the workload significantly. > > In the unlikely event of an issue allocating memory, this would be > unwelcome. If we can avoid it, it would be best. I don't mind being > slow in error paths, but a significant workload would be rather bad on > an overloaded system. > >>> >>>> >>>> Using what is done in exit_mmap() and do_vmi_align_munmap() as a >>>> prototype, we can do something like the *untested* code below: >>>> >>>> if (unlikely(mas_is_err(&vmi.mas))) { >>>>     unsigned long max = vmi.index; >>>> >>>>     retval = xa_err(vmi.mas.node); >>>>     mas_set(&vmi.mas, 0); >>>>     tmp = mas_find(&vmi.mas, ULONG_MAX); >>>>     if (tmp) { /* Not the first VMA failed */ >>>>         unsigned long nr_accounted = 0; >>>> >>>>         unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max, >>>>                 true); >>>>         do { >>>>             if (vma->vm_flags & VM_ACCOUNT) >>>>                 nr_accounted += vma_pages(vma); >>>>             remove_vma(vma, true); >>>>             cond_resched(); >>>>             vma = mas_find(&vmi.mas, max - 1); >>>>         } while (vma != NULL); >>>> >>>>         vm_unacct_memory(nr_accounted); >>>>     } >>>>     __mt_destroy(&mm->mm_mt); >>>>     goto loop_out; >>>> } >>>> >>>> Once exit_mmap() is called, the check for OOM (no vma) will catch that >>>> nothing is left to do. >>>> >>>> It might be worth making an inline function to do this to keep the fork >>>> code clean.  We should test this by detecting a specific task name and >>>> returning a failure at a given interval: >>>> >>>> if (!strcmp(current->comm, "fork_test") { >>>> ... >>>> } >>>> > ... > > > Thanks, > Liam >