From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 13B38C54EE9 for ; Tue, 27 Sep 2022 19:23:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7BD858E00FB; Tue, 27 Sep 2022 15:23:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 76BD38E00C1; Tue, 27 Sep 2022 15:23:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 60CCA8E00FB; Tue, 27 Sep 2022 15:23:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 4E3A88E00C1 for ; Tue, 27 Sep 2022 15:23:22 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 14C151C66FA for ; Tue, 27 Sep 2022 19:23:22 +0000 (UTC) X-FDA: 79958839044.26.2B7A661 Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) by imf08.hostedemail.com (Postfix) with ESMTP id 9E984160014 for ; Tue, 27 Sep 2022 19:23:21 +0000 (UTC) Received: by mail-pl1-f181.google.com with SMTP id t3so9973554ply.2 for ; Tue, 27 Sep 2022 12:23:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date; bh=irMbtd8FZdiIboGibKNEwpTqt4Jj4yNLbVUI1gMwaPk=; b=e0FQ6dEVHGBnnI6HqwHOLFzljuzwHYhz+8RFJ0db4risESGFamZ+UkWBPp8hD8O4qr 4VyJSyWkkMh8iRaCxLOZFA2R3V/It7lko9Ic4CYQYVi6wHUO7A/zCpLRexsMrFULSCH4 L5cbdNFEbn0q08Zv5vq4ZRaNt90B4qavACQ1ZEfTAbMExzqvrr6rSJH7AKHC6r5R+Tyy 77E9rONRP2Lk3ZVQrDVzqJgInQVokonQBKLs2S0uum07//G6kgoCBoSoe9R4QFxmI89K vSQcEtob0Rgd9W21dmpmg+3xKF56nL/WTSwkOajK+Edlm+bkLNz/y5P/1M2xjSHHcI1z 2ogQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date; bh=irMbtd8FZdiIboGibKNEwpTqt4Jj4yNLbVUI1gMwaPk=; b=AW5II1R6dsczBTrsxHqdRpTU7+KkdUKvWy5XOndy8V/9f60Utyxh/1zMp+h1BvUtUB s2zbRyTf+6hrVEVCOAKDcj+6xM9Sn3qCZjubRDnRMorpGsmS9S9qxB5gDEtZs/OfJGKO 86AgL8Y8PR45Kx/ty6gD1lqp2ogFbCv3bIWKE9i1AAGPVwtZ6wmI1XaXdaMfgP2f/p01 lcEQw5pkE5Wke1GiwWTbLmvKOL6Jn92qI0o1NvkswmaLBryGm+rnyRu89zBsg21ADcO7 5sa7afQ0Cg+LQ1vgI8UNtPYwlLh6jh8tclJfl3wkQBwJuY+5prACmXtgwhkWVs26oNx5 8fQA== X-Gm-Message-State: ACrzQf1Wn1bb5WfcGp8qT68Bxz3eKGwUPoYOHD+ijQxafi3YLa9i0ip1 +dF4WuH3bFwyf/i/4g6UIak= X-Google-Smtp-Source: AMsMyM4skroVhGNnqPKoINwdTfOGZ9HFN/FSTUS/QRpvxcQlBjxJf3Qj5fSFlWtyuniJaTLMZgkAcA== X-Received: by 2002:a17:903:509:b0:179:ffcf:d275 with SMTP id jn9-20020a170903050900b00179ffcfd275mr333187plb.150.1664306600434; Tue, 27 Sep 2022 12:23:20 -0700 (PDT) Received: from strix-laptop (2001-b011-20e0-1b32-d148-19d6-82fa-6094.dynamic-ip6.hinet.net. [2001:b011:20e0:1b32:d148:19d6:82fa:6094]) by smtp.gmail.com with ESMTPSA id p3-20020a1709026b8300b0016d773aae60sm1949566plk.19.2022.09.27.12.23.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Sep 2022 12:23:19 -0700 (PDT) Date: Wed, 28 Sep 2022 03:23:13 +0800 From: Chih-En Lin To: Nadav Amit Cc: Andrew Morton , Qi Zheng , David Hildenbrand , Matthew Wilcox , Christophe Leroy , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Luis Chamberlain , Kees Cook , Iurii Zaikin , Vlastimil Babka , William Kucharski , "Kirill A . Shutemov" , Peter Xu , Suren Baghdasaryan , Arnd Bergmann , Tong Tiangen , Pasha Tatashin , Li kunyu , Anshuman Khandual , Minchan Kim , Yang Shi , Song Liu , Miaohe Lin , Thomas Gleixner , Sebastian Andrzej Siewior , Andy Lutomirski , Fenghua Yu , Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng Subject: Re: [RFC PATCH v2 7/9] mm: Add the break COW PTE handler Message-ID: References: <20220927162957.270460-1-shiyn.lin@gmail.com> <20220927162957.270460-8-shiyn.lin@gmail.com> <8F98262B-206B-434C-88B9-9F3A6919782D@vmware.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <8F98262B-206B-434C-88B9-9F3A6919782D@vmware.com> ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1664306601; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=irMbtd8FZdiIboGibKNEwpTqt4Jj4yNLbVUI1gMwaPk=; b=HZMSv8Le5aDZq7+q/OjYTIqrjfovC8wZdnnDAOZ8yxCMPddZtVigk5EwpMmQqEczzR2Huu NdhxJc05Qb7CfTwgico6nlCGzh2V50HGEl53cFjVkFbr3e3Ydade9RgrYzoKNxxet5rT+c YVqhkG2CGMygHWCwDew/IlNzMo6fia8= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=e0FQ6dEV; spf=pass (imf08.hostedemail.com: domain of shiyn.lin@gmail.com designates 209.85.214.181 as permitted sender) smtp.mailfrom=shiyn.lin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1664306601; a=rsa-sha256; cv=none; b=RhksJhxAkTVSy3Yz4whigR0e69Tdd9kgv1r4CM2iy7hCVy4elKOsBXxB2UyBdOAacX2t1c RNi7RrtF2LfPMIyUV6atQeCTvcloDX3DuQXW6uMCa6kSjMDmCWjELHeczRXS2jh4kD/BKn dke9eAHBlvVtJegsd4TziKNXCMIixhM= X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 9E984160014 X-Rspam-User: Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=e0FQ6dEV; spf=pass (imf08.hostedemail.com: domain of shiyn.lin@gmail.com designates 209.85.214.181 as permitted sender) smtp.mailfrom=shiyn.lin@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Stat-Signature: 95ih8iaa5yeqywmo4onra5m5akrhqc7y X-HE-Tag: 1664306601-762296 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Sep 27, 2022 at 06:15:34PM +0000, Nadav Amit wrote: > On Sep 27, 2022, at 9:29 AM, Chih-En Lin wrote: > > > To handle the COW PTE with write fault, introduce the helper function > > handle_cow_pte(). The function provides two behaviors. One is breaking > > COW by decreasing the refcount, pgables_bytes, and RSS. Another is > > copying all the information in the shared PTE table by using > > copy_pte_page() with a wrapper. > > > > Also, add the wrapper functions to help us find out the COWed or > > COW-available PTE table. > > > > [ snip ] > > > +static inline int copy_cow_pte_range(struct vm_area_struct *vma, > > + pmd_t *dst_pmd, pmd_t *src_pmd, > > + unsigned long start, unsigned long end) > > +{ > > + struct mm_struct *mm = vma->vm_mm; > > + struct mmu_notifier_range range; > > + int ret; > > + bool is_cow; > > + > > + is_cow = is_cow_mapping(vma->vm_flags); > > + if (is_cow) { > > + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, > > + 0, vma, mm, start, end); > > + mmu_notifier_invalidate_range_start(&range); > > + mmap_assert_write_locked(mm); > > + raw_write_seqcount_begin(&mm->write_protect_seq); > > + } > > + > > + ret = copy_pte_range(vma, vma, dst_pmd, src_pmd, start, end); > > + > > + if (is_cow) { > > + raw_write_seqcount_end(&mm->write_protect_seq); > > + mmu_notifier_invalidate_range_end(&range); > > Usually, I would expect mmu-notifiers and TLB flushes to be initiated at the > same point in the code. Presumably you changed protection, so you do need a > TLB flush, right? Is it done elsewhere? You're right. I will add TLB flushes here. Thanks. > > + } > > + > > + return ret; > > +} > > + > > +/* > > + * Break COW PTE, two state here: > > + * - After fork : [parent, rss=1, ref=2, write=NO , owner=parent] > > + * to [parent, rss=1, ref=1, write=YES, owner=NULL ] > > + * COW PTE become [ref=1, write=NO , owner=NULL ] > > + * [child , rss=0, ref=2, write=NO , owner=parent] > > + * to [child , rss=1, ref=1, write=YES, owner=NULL ] > > + * COW PTE become [ref=1, write=NO , owner=parent] > > + * NOTE > > + * - Copy the COW PTE to new PTE. > > + * - Clear the owner of COW PTE and set PMD entry writable when it is owner. > > + * - Increase RSS if it is not owner. > > + */ > > +static int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, > > + unsigned long addr) > > +{ > > + struct mm_struct *mm = vma->vm_mm; > > + unsigned long pte_start, pte_end; > > + unsigned long start, end; > > + struct vm_area_struct *prev = vma->vm_prev; > > + struct vm_area_struct *next = vma->vm_next; > > + pmd_t cowed_entry = *pmd; > > + > > + if (cow_pte_count(&cowed_entry) == 1) { > > + cow_pte_fallback(vma, pmd, addr); > > + return 1; > > + } > > + > > + pte_start = start = addr & PMD_MASK; > > + pte_end = end = (addr + PMD_SIZE) & PMD_MASK; > > + > > + pmd_clear(pmd); > > + /* > > + * If the vma does not cover the entire address range of the PTE table, > > + * it should check the previous and next. > > + */ > > + if (start < vma->vm_start && prev) { > > + /* The part of address range is covered by previous. */ > > + if (start < prev->vm_end) > > + copy_cow_pte_range(prev, pmd, &cowed_entry, > > + start, prev->vm_end); > > + start = vma->vm_start; > > + } > > + if (end > vma->vm_end && next) { > > + /* The part of address range is covered by next. */ > > + if (end > next->vm_start) > > + copy_cow_pte_range(next, pmd, &cowed_entry, > > + next->vm_start, end); > > + end = vma->vm_end; > > + } > > + if (copy_cow_pte_range(vma, pmd, &cowed_entry, start, end)) > > + return -ENOMEM; > > + > > + /* > > + * Here, it is the owner, so clear the ownership. To keep RSS state and > > + * page table bytes correct, it needs to decrease them. > > + * Also, handle the address range issue here. > > + */ > > + if (cow_pte_owner_is_same(&cowed_entry, pmd)) { > > + set_cow_pte_owner(&cowed_entry, NULL); > > Presumably there is some assumption on atomicity here. Otherwise, two > threads can run the following code, which is wrong, no? Yet, I do not see > anything that provides such atomicity. I may have multiple process access here. But for the thread, I assume that they need to hold the mmap_lock. Maybe I need to add the assert here too. > > > + if (pte_start < vma->vm_start && prev && > > + pte_start < prev->vm_end) > > + cow_pte_rss(mm, vma->vm_prev, pmd, > > + pte_start, prev->vm_end, false /* dec */); > > + if (pte_end > vma->vm_end && next && > > + pte_end > next->vm_start) > > + cow_pte_rss(mm, vma->vm_next, pmd, > > + next->vm_start, pte_end, false /* dec */); > > + cow_pte_rss(mm, vma, pmd, start, end, false /* dec */); > > + mm_dec_nr_ptes(mm); > > + } > > + > > + /* Already handled it, don't reuse cowed table. */ > > + pmd_put_pte(vma, &cowed_entry, addr, false); > > + > > + VM_BUG_ON(cow_pte_count(pmd) != 1); > > Don’t use VM_BUG_ON(). Sure. I will change it to VM_WARN_ON(). Thanks, Chih-En Lin