From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A3B4C4167B for ; Mon, 27 Nov 2023 10:00:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF7826B0285; Mon, 27 Nov 2023 05:00:13 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BA60F6B0287; Mon, 27 Nov 2023 05:00:13 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A6DEF6B028F; Mon, 27 Nov 2023 05:00:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 9403D6B0285 for ; Mon, 27 Nov 2023 05:00:13 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 62221120177 for ; Mon, 27 Nov 2023 10:00:13 +0000 (UTC) X-FDA: 81503288706.16.CD524B0 Received: from mail-vk1-f171.google.com (mail-vk1-f171.google.com [209.85.221.171]) by imf30.hostedemail.com (Postfix) with ESMTP id 0A25D8000C for ; Mon, 27 Nov 2023 10:00:09 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eZdiavV6; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf30.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.171 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701079210; a=rsa-sha256; cv=none; b=SJx2gRQjFYyhotCMTkBE6+JpmtDKp8LLh7RVye5+xN/CyOSO+Ck3jJul7S9c9lIfzaw+gq HItl61+fEzKiXVWuevOgtuR0AW5aZcGpM4EjonPMGZuAXtzMq4OjOE6XhJpJl8QZLhIVj8 zQABLUNr1EUkpsHi+N4nc0bxvSjdgJQ= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eZdiavV6; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf30.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.171 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701079210; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=n6tK140abEWkHitKUAk7mI5hF8YD1QH02qgtvosYmP0=; b=uV6kk0KhGnDfTnD5YqVKYXQdS253mo0sh+44Qp0RcJdxP41f/gqcmWeezsIR1KOYirokuW hVu0fJR26HIN0Yp9lF+IYzPSgAffUXyU3JG34Q9AVtHh9ZVKqFTOOTLPsRasiztJNHzNNg ZJwahGy+krpUbOcaXlHbtvvNk8is91U= Received: by mail-vk1-f171.google.com with SMTP id 71dfb90a1353d-4ac20c41e82so946367e0c.1 for ; Mon, 27 Nov 2023 02:00:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701079209; x=1701684009; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=n6tK140abEWkHitKUAk7mI5hF8YD1QH02qgtvosYmP0=; b=eZdiavV6BKo0KzpF/8uWd7XCv3kufEAwLZWbkkStMDyPt3vgXrZxBYMVtXXYksT3Hh Y6HLln862JjpIo33eeJIjMy+X7ofhcqBj/Zl06/NN/zuxQz6zPQIe1ULudEUBaLOMD6S R0ERzYbmO3AfBorpo+6moyeRxZNbuFRtT+Cnri3zEp9yGZZR+Qd/Nrhq53aU+zXiHb2C 5Bdz02Thx3lmF9bF5nQ4Pw4B8SHnWtw5UBRV/9/yr/d1GScmeXDVXJcnqVhXSJDXcHSZ jiTejBzwyaXOIzeCJH/qBWT9q+foUrBI1dCDeXiI9UGBAivMG7GWkb1nOyJmDZS9aSPl zkaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701079209; x=1701684009; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=n6tK140abEWkHitKUAk7mI5hF8YD1QH02qgtvosYmP0=; b=rv5gkdxFMQmgJS7MUlTc4YQgTJuRgFuT5QbNSy8ROifzLxDViSalPLGs7PW0SWbrAx 8Hxi/Fp5U3lzgKdhIgAvnjZv6ge5e+12CdiAa1gX1yWTg8q07ngUv3mxdSG9XNXlkq5T lgJH7e9GfOlpo5Pq9768wo/Mrv59E3v9MNE6bUqIleUa83YJDAbptRuAGVP5lnYkUZBt EEDsGLJQTBohQgCEP92Y+IuucNMjxybJ6h/CuO05wUSE8FobcCiCwFIaBDMlYmA/PAls EeUG1DkScyEkbfcIpFEF2GNQh5QJuMxWKlGilPjC0iz0f2mgm62F6eS5l45B2fzh5BgJ NNYg== X-Gm-Message-State: AOJu0Yyin4rSZmIpOhDG0d2p31cjJNIhWC5JYQf8HPJJ4caJ9n15ihu6 5UYV+YQwP2JG005Mr3iyIV6PSL+6RF2BnrMtPlI= X-Google-Smtp-Source: AGHT+IF1PdxjbshAZwpn/vM6CjAb0hjPNBNo+ktSuZ2lE0Wf9/gedMAWGiQ5D48xnMbr56I2JseWkmBq2ITcS7jgd9w= X-Received: by 2002:a05:6122:4592:b0:4a4:680:bfad with SMTP id de18-20020a056122459200b004a40680bfadmr11294574vkb.7.1701079208944; Mon, 27 Nov 2023 02:00:08 -0800 (PST) MIME-Version: 1.0 References: <271f1e98-6217-4b40-bae0-0ac9fe5851cb@redhat.com> <20231127084217.13110-1-v-songbaohua@oppo.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Mon, 27 Nov 2023 22:59:57 +1300 Message-ID: Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork() To: Ryan Roberts Cc: david@redhat.com, akpm@linux-foundation.org, andreyknvl@gmail.com, anshuman.khandual@arm.com, ardb@kernel.org, catalin.marinas@arm.com, dvyukov@google.com, glider@google.com, james.morse@arm.com, jhubbard@nvidia.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mark.rutland@arm.com, maz@kernel.org, oliver.upton@linux.dev, ryabinin.a.a@gmail.com, suzuki.poulose@arm.com, vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yuzenghui@huawei.com, yuzhao@google.com, ziy@nvidia.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 0A25D8000C X-Stat-Signature: 619fjhs1uisid13166w8p7kwtmrgpqet X-HE-Tag: 1701079209-115455 X-HE-Meta: U2FsdGVkX1/IGbOs+bNqLk/MbmN+KRIXd4odK1ym3OHOYD1f3fmwMQBI5tbp/dLXAitOqUT+cI98ytA1huMEjHrVVk28l8bdDYy/IhENvl91LAQRyGYvz5GYjPqSpOnpEVutqZ5HX26XB1+1ldcsd/u4pW4IJkJsViIA6vdyA8tR1Ns/t5e9n2Db+P6XDiC8IejnhbZG7C7OqZW23XG6DuutOmkOsqBqkdjeZqJ25xdR6KBKjGAUpgI4gTcnD/4euyWPAqH2gdgzkBc8tBC+ohKU36mKTPbNll7JAp46jH4v2+1EdumlucXb4z0TaCbxilyVrS49XGjPligmXu7M0/Ss9O8B9uYdH9BVTWBLb66UmaBIy2Rw8nSYJ7xuU8mj/uXx6nfTlwLJ4sFH1hHCC52vjUGxJs6CnubeVGhLfwJlzg/hvKLkbiOh98ZKR2UTnbPu8Z263LnCHHWuiybtQmEy7yihQ7g6yooFQIVF1KoNSETDNTApJvSz3Gz0nLDzZI1ekxVrG2CwC3j5py9K1yOZNofqxDb7aMtsLppSudr83eKCrX6SYYOMx361SIQrfM57iwBN6BWKGNCEIonI81RHNPymnT+5nGboU8KObUJmjQu9n+fiUGIeWVOrvZg8SWxiezzMEilaahonHLVyuZoRTKhuD7H8eJqjrxIfSSbylSrvoB/8jlCMT0Xk/Z7irC0uF/yHQFmF25awuJFiK0wPMYKfvKHnv6RlQGX3EgzwjuGixg2E/c8ygbEjY3vc5b+YAthSpQzIaAR9MfQ3qkNrhdMR0NaEXbvhldc+KAC1yC/PSvDv8X7shE6VSR4rxk7MiNEo9kvsuwhjEf0K2Hsryx2/U7tuSz77nXG2KY7Qbo2nEZZ4qrLC3E07QqJYmkfXURCLNU6RdljRCf9hx4rf/9vIAYrZuJG77odBZowKLdXBGT7tdHoK0S5TnWb9z+l3Ww3lIM5PYHFnYnn QcGSWWed hGpNZgh81JKd4pNMtPHzxWcHvIOhA6nK7kl92v+ml8HHZsEt7mm/SqH+QEQPJgq7Pp65wl4/2HGLUfJRHjguUWMJHwDpp+gc4K3t9rfbepOAItMdK+PVts++0Mw0YgY1xUDFF6Sa8dAZ2WbwJW7bjJ7tpfDvV78ZWBqMd46cUlLEoyRiu57sgmiFHXzWtTuGdmb2gwJSa5WhR0/H0ZnSsf7rPkLqMq9wZmqYMnoztoOzuL69l2Mwyt0ScU7S/jegGysQHK1bL4TEaT5YJi0mC0UJG2POkQlVaEpsHSFlHGkTxRQ905rC37UPagKDoFEYcR2VIvYRJKhbWu+ioeV2GHtm6ltMX2hbyg5qIvC0cZQmxVfRCFqih1/g09UJ41FftavJolprfLumO/eK5ttNEqDCJBjqaNqD3ysr717ozJGLhci9f0xatEF4vSx3zLmK8HCrkXu6vcFwh0sWiPIQqH+zvPIVGWHXC4Mbw9sg2l8tDvGw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Nov 27, 2023 at 10:35=E2=80=AFPM Ryan Roberts wrote: > > On 27/11/2023 08:42, Barry Song wrote: > >>> + for (i =3D 0; i < nr; i++, page++) { > >>> + if (anon) { > >>> + /* > >>> + * If this page may have been pinned by t= he > >>> + * parent process, copy the page immediat= ely for > >>> + * the child so that we'll always guarant= ee the > >>> + * pinned page won't be randomly replaced= in the > >>> + * future. > >>> + */ > >>> + if (unlikely(page_try_dup_anon_rmap( > >>> + page, false, src_vma))) { > >>> + if (i !=3D 0) > >>> + break; > >>> + /* Page may be pinned, we have to= copy. */ > >>> + return copy_present_page( > >>> + dst_vma, src_vma, dst_pte= , > >>> + src_pte, addr, rss, preal= loc, > >>> + page); > >>> + } > >>> + rss[MM_ANONPAGES]++; > >>> + VM_BUG_ON(PageAnonExclusive(page)); > >>> + } else { > >>> + page_dup_file_rmap(page, false); > >>> + rss[mm_counter_file(page)]++; > >>> + } > >>> } > >>> - rss[MM_ANONPAGES]++; > >>> - } else if (page) { > >>> - folio_get(folio); > >>> - page_dup_file_rmap(page, false); > >>> - rss[mm_counter_file(page)]++; > >>> + > >>> + nr =3D i; > >>> + folio_ref_add(folio, nr); > >> > >> You're changing the order of mapcount vs. refcount increment. Don't. > >> Make sure your refcount >=3D mapcount. > >> > >> You can do that easily by doing the folio_ref_add(folio, nr) first and > >> then decrementing in case of error accordingly. Errors due to pinned > >> pages are the corner case. > >> > >> I'll note that it will make a lot of sense to have batch variants of > >> page_try_dup_anon_rmap() and page_dup_file_rmap(). > >> > > > > i still don't understand why it is not a entire map+1, but an increment > > in each basepage. > > Because we are PTE-mapping the folio, we have to account each individual = page. > If we accounted the entire folio, where would we unaccount it? Each page = can be > unmapped individually (e.g. munmap() part of the folio) so need to accoun= t each > page. When PMD mapping, the whole thing is either mapped or unmapped, and= its > atomic, so we can account the entire thing. Hi Ryan, There is no problem. for example, a large folio is entirely mapped in process A with CONPTE, and only page2 is mapped in process B. then we will have entire_map =3D 0 page0.map =3D -1 page1.map =3D -1 page2.map =3D 0 page3.map =3D -1 .... > > > > > as long as it is a CONTPTE large folio, there is no much difference wit= h > > PMD-mapped large folio. it has all the chance to be DoubleMap and need > > split. > > > > When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or an= y > > similar things on a part of the large folio in process A, > > > > this large folio will have partially mapped subpage in A (all CONTPE bi= ts > > in all subpages need to be removed though we only unmap a part of the > > large folioas HW requires consistent CONTPTEs); and it has entire map i= n > > process B(all PTEs are still CONPTES in process B). > > > > isn't it more sensible for this large folios to have entire_map =3D 0(f= or > > process B), and subpages which are still mapped in process A has map_co= unt > > =3D0? (start from -1). > > > >> Especially, the batch variant of page_try_dup_anon_rmap() would only > >> check once if the folio maybe pinned, and in that case, you can simply > >> drop all references again. So you either have all or no ptes to proces= s, > >> which makes that code easier. > > I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But > fundamentally you can only use entire_mapcount if its only possible to ma= p and > unmap the whole folio atomically. My point is that CONTPEs should either all-set in all 16 PTEs or all are dr= opped in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise, it is partially mapped. if a large folio is mapped in one processes with all CONTPTEs and meanwhile in another process with partial mapping(w/o CONTPTE), it is DoubleMapped. Since we always hold ptl to set or drop CONTPTE bits, set/drop is still atomic in a spinlock area. > > >> > >> But that can be added on top, and I'll happily do that. > >> > >> -- > >> Cheers, > >> > >> David / dhildenb > > Thanks Barry