From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51996C4167B for ; Mon, 27 Nov 2023 10:29:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E06BB6B02DD; Mon, 27 Nov 2023 05:29:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DB7016B02DF; Mon, 27 Nov 2023 05:29:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CA5636B02E6; Mon, 27 Nov 2023 05:29:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id BA0436B02DD for ; Mon, 27 Nov 2023 05:29:05 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 80912C0193 for ; Mon, 27 Nov 2023 10:29:05 +0000 (UTC) X-FDA: 81503361450.05.7DD1DCE Received: from mail-ua1-f50.google.com (mail-ua1-f50.google.com [209.85.222.50]) by imf11.hostedemail.com (Postfix) with ESMTP id 0C8BD4001E for ; Mon, 27 Nov 2023 10:29:01 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JF8i+ag2; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf11.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.50 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701080942; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=shZQ+S7x4tTu8+8gjD68J2xWVVR//+1g56EcTBBB4d0=; b=4W+Nt0BsREZeB0GYrQ+v82KuCZ930tXSbtlJrmunKW+ZhoDIiBAiddNA0HhEtmOM6l3hnG NzDMI0e/zvX6D4hvezeYbeRM+/z/vLSqzJCIwwjYHKLU9A8BnaMroM3i6+1SpQtNDVmSjz 5BdSChevBkUPp2X+gzMsNC+4aeIm1cY= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JF8i+ag2; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf11.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.50 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701080942; a=rsa-sha256; cv=none; b=zRONhB75WTPOV9i5cDKa9sbGtaES4nO1MqbaoRgT268ofO+0Nvzh3KI/bxjbbpTcQWsU2G nGUqiZQw3/S/Ujo0rjryui0Jj7ztt+2hXsY50bJCDL08HvMIqMG532YP7UTPYF+lblvjDK vrM4ls2iyTmbziGdFAfhLWXkF6zwGjQ= Received: by mail-ua1-f50.google.com with SMTP id a1e0cc1a2514c-7c45acb3662so1080461241.0 for ; Mon, 27 Nov 2023 02:29:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701080941; x=1701685741; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=shZQ+S7x4tTu8+8gjD68J2xWVVR//+1g56EcTBBB4d0=; b=JF8i+ag2Pu3MD+kc3YCRV6SVhzL3EYPgG5hH0WqURGXk4yXQgB+OT6bgguXR6qb8BG iu26jz4yOZZpEsfNbQlc4a+EUHjrla6SPMJ3e6dzd0QatLiuTSq8/hBYfeIGAiHuTty7 ZkLsdrpgHqwuvoPMOdXBY7an0QOOK1oBYUImyJGHlA54P3CXneDhJRL/eSScGMb9vh+r s5cmwAITNYA+dvm5k51jQy41eRO1SfWah7tdXjAhZVdnTIwMVAydRXxDk6vao8KzHfZV owKAi7xes+rj0QoyRYytffy9L7cRvyOeXKsI2Lzd+V7M8u+1TzxH6LwFv+nhphJogGkg VPbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701080941; x=1701685741; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=shZQ+S7x4tTu8+8gjD68J2xWVVR//+1g56EcTBBB4d0=; b=pb9C7IXFfRcZZHVZICi0iCm0Eemx608yme9vcDnH/hn07k+lKfRJO/3fWjRHUBWLQ/ wlT47/RbdUDzp2HUrky5OoAq1jfoIgaG6VKPriw5f1cNpmVjMj0KPCIYfTi92PPa6Me7 wRAqY6+Rfsc32ACb3jSreLsDLwaZyipl8sM5VVTi7WAe+FupXOqmp5eQ4haxV3CAE9hy NzBiWLM5Q3AQfnnzt0zy3LyomkQNuqTFQRejCXDgDkDb65B839e2L37HLFyqF/vKzSup VOgv0hV+/W1fZzOMvICl6EIcKEzLfcJs2e00g59X+H2UfIEXEapZD/xLoUAY3z7veJt2 FVKA== X-Gm-Message-State: AOJu0YxKLKUXQAkszfLjPsexSwRlWI9QpSvbRO5BsxG0+92QedAJxUKu VJfx4keDLVbM3B/AxmU4nwL8rkcvR6DwHpqo9cM= X-Google-Smtp-Source: AGHT+IFBDA6vCKUQacA5Dijs0r8dEmG6MgJxt9NqpRTxprHkakZ7PCzpRZI60+9ydCphoOz/Uk+sV8qY9FDo8OuiZ0I= X-Received: by 2002:a67:bd1a:0:b0:462:71ee:5ef6 with SMTP id y26-20020a67bd1a000000b0046271ee5ef6mr5438891vsq.17.1701080941077; Mon, 27 Nov 2023 02:29:01 -0800 (PST) MIME-Version: 1.0 References: <271f1e98-6217-4b40-bae0-0ac9fe5851cb@redhat.com> <20231127084217.13110-1-v-songbaohua@oppo.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Mon, 27 Nov 2023 23:28:49 +1300 Message-ID: Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork() To: Ryan Roberts Cc: david@redhat.com, akpm@linux-foundation.org, andreyknvl@gmail.com, anshuman.khandual@arm.com, ardb@kernel.org, catalin.marinas@arm.com, dvyukov@google.com, glider@google.com, james.morse@arm.com, jhubbard@nvidia.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mark.rutland@arm.com, maz@kernel.org, oliver.upton@linux.dev, ryabinin.a.a@gmail.com, suzuki.poulose@arm.com, vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yuzenghui@huawei.com, yuzhao@google.com, ziy@nvidia.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 0C8BD4001E X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: iji148se57dbg8z471t5nbrpk79bm4t1 X-HE-Tag: 1701080941-703653 X-HE-Meta: U2FsdGVkX188xe9mpu9HOGINPYPVJNNNB9ecDleacb8dWeDZu7uBuYeGUEHKnr+xIZ9WElreyITw4cRrzIKvmUTS5lb6486akmyjExy8YcyFkfv+w4mWl1eqS+atGeGaPB9e1/gdP5vWlyGnDF5LsX/MIbu6pUJEonqKVXOXZ1XE+RZKfIdl8DfPKnnqZ+fnjfT78oKMNQByzDbpozWV21NAMx5WI3JvHr6503n7+YguXUueJ5V94JiFf+0MAwnXBU8jdXYjIbr8ft5EX2HksjdlcjyqhRs5Cw7hvzXRKmTSc8F6jmdEXIADgZ7S+Am0uhRw/LMKlf8sKCwJuKSyCYoDDspS29ZT4Vu0jf/PZGENqvqt7dUn+R/0Qet5eufBetYfpshHUyPtGXMj6F2HFQYhQO795VbJ/8KeZ/qFgMeNPEVZIXTfjqkU3xupE1ed5RDAoUZ1n7yqe6kqqaS5DtUCJKGmRuxyO4CukwNjA55sU+zqwzpbM9mi6jKeOgcQazt3+78XocM7/coQa34YiqClfC+S/DcoHAukg7Wwq73WH0GV6UH2obm/L6NEVhO6jCfBC3jWvWFmNPk8csemYDl8A4BbH8gNQRcLR7Py8X6TouA+kaoYjeNyMWBw6e/+Li0HoPL2Igqvf3yLHsXQu2fvCP+X+sjkO9lTPFlYCMAUr810kW4LrPIV3+u1n1g2b77MydQJRihrv4JL+Qep3T8F2a+dxRMNVz+AAxpZSAHlz85y5AP83rcODFk4gTJy0wWGh4uAQVIh8apC7SgwKW0f06yAH7Y4XAUro71FnSl5PXB9sAjEKpLlmZG+f1K2b/Zq+WaA84Hrd+ewpFJeLjvv1qqevoDcnNyGW0SPgVO1kW/Q7uvKrnUup8ZTN9O+pqCaDumEOnK/G4pf2JG4KDxqBtOPIwfM2yAKqehB2tfPAYwKZ7fBh0G5qDh1IA4JsG7WauN5ifk5lCIKFpp XMKm6uoQ MIWQfzCRQq0xwuwBr3vP8jQQTFg6PX3d9xitoSqBI1l9lwY4FdqBK8FAeLAie2M2LRfYMBAU0JqU8eJljXocVn1Z/IPg8Rx5XamylE4FSOJsdRfsciF94eU16gTQYvt7DTIIkBJ82Ca/9YZvF2o03XwXH0SOyZmJYpFcuk9DhiepGXIpvUcOfrOXgkbXObspvhifjDq7VpiPOBYcNRdtWxJ5JwrWCXcOmYU72lnK7imJdk7UJvXXmqCEHRqksteEVrZGmmgnTeAks8XkVBitqCFNQVghLTg+2tAHABEpMrJhDYBNvWxp4w3WOFKMaJz1Wj05rIWsLZPt1q/HBi1l46tcGNyEW5IH+/LmYsyzK09c3QlwRKF1GjLpZhI9lH9Kzz2pBewETqTvMr+4CrRn/C22mSfQnvF7WxTjl8Kc9EIOQP4+0i80n/euBc8tffnhhJBMtTO/cVbaGOCpBao9dmgaY++mBw4Wb63KoMe0R9ojuaX0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Nov 27, 2023 at 11:11=E2=80=AFPM Ryan Roberts wrote: > > On 27/11/2023 09:59, Barry Song wrote: > > On Mon, Nov 27, 2023 at 10:35=E2=80=AFPM Ryan Roberts wrote: > >> > >> On 27/11/2023 08:42, Barry Song wrote: > >>>>> + for (i =3D 0; i < nr; i++, page++) { > >>>>> + if (anon) { > >>>>> + /* > >>>>> + * If this page may have been pinned by= the > >>>>> + * parent process, copy the page immedi= ately for > >>>>> + * the child so that we'll always guara= ntee the > >>>>> + * pinned page won't be randomly replac= ed in the > >>>>> + * future. > >>>>> + */ > >>>>> + if (unlikely(page_try_dup_anon_rmap( > >>>>> + page, false, src_vma)))= { > >>>>> + if (i !=3D 0) > >>>>> + break; > >>>>> + /* Page may be pinned, we have = to copy. */ > >>>>> + return copy_present_page( > >>>>> + dst_vma, src_vma, dst_p= te, > >>>>> + src_pte, addr, rss, pre= alloc, > >>>>> + page); > >>>>> + } > >>>>> + rss[MM_ANONPAGES]++; > >>>>> + VM_BUG_ON(PageAnonExclusive(page)); > >>>>> + } else { > >>>>> + page_dup_file_rmap(page, false); > >>>>> + rss[mm_counter_file(page)]++; > >>>>> + } > >>>>> } > >>>>> - rss[MM_ANONPAGES]++; > >>>>> - } else if (page) { > >>>>> - folio_get(folio); > >>>>> - page_dup_file_rmap(page, false); > >>>>> - rss[mm_counter_file(page)]++; > >>>>> + > >>>>> + nr =3D i; > >>>>> + folio_ref_add(folio, nr); > >>>> > >>>> You're changing the order of mapcount vs. refcount increment. Don't. > >>>> Make sure your refcount >=3D mapcount. > >>>> > >>>> You can do that easily by doing the folio_ref_add(folio, nr) first a= nd > >>>> then decrementing in case of error accordingly. Errors due to pinned > >>>> pages are the corner case. > >>>> > >>>> I'll note that it will make a lot of sense to have batch variants of > >>>> page_try_dup_anon_rmap() and page_dup_file_rmap(). > >>>> > >>> > >>> i still don't understand why it is not a entire map+1, but an increme= nt > >>> in each basepage. > >> > >> Because we are PTE-mapping the folio, we have to account each individu= al page. > >> If we accounted the entire folio, where would we unaccount it? Each pa= ge can be > >> unmapped individually (e.g. munmap() part of the folio) so need to acc= ount each > >> page. When PMD mapping, the whole thing is either mapped or unmapped, = and its > >> atomic, so we can account the entire thing. > > > > Hi Ryan, > > > > There is no problem. for example, a large folio is entirely mapped in > > process A with CONPTE, > > and only page2 is mapped in process B. > > then we will have > > > > entire_map =3D 0 > > page0.map =3D -1 > > page1.map =3D -1 > > page2.map =3D 0 > > page3.map =3D -1 > > .... > > > >> > >>> > >>> as long as it is a CONTPTE large folio, there is no much difference w= ith > >>> PMD-mapped large folio. it has all the chance to be DoubleMap and nee= d > >>> split. > >>> > >>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or = any > >>> similar things on a part of the large folio in process A, > >>> > >>> this large folio will have partially mapped subpage in A (all CONTPE = bits > >>> in all subpages need to be removed though we only unmap a part of the > >>> large folioas HW requires consistent CONTPTEs); and it has entire map= in > >>> process B(all PTEs are still CONPTES in process B). > >>> > >>> isn't it more sensible for this large folios to have entire_map =3D 0= (for > >>> process B), and subpages which are still mapped in process A has map_= count > >>> =3D0? (start from -1). > >>> > >>>> Especially, the batch variant of page_try_dup_anon_rmap() would only > >>>> check once if the folio maybe pinned, and in that case, you can simp= ly > >>>> drop all references again. So you either have all or no ptes to proc= ess, > >>>> which makes that code easier. > >> > >> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. = But > >> fundamentally you can only use entire_mapcount if its only possible to= map and > >> unmap the whole folio atomically. > > > > > > > > My point is that CONTPEs should either all-set in all 16 PTEs or all ar= e dropped > > in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise, > > it is partially > > mapped. if a large folio is mapped in one processes with all CONTPTEs > > and meanwhile in another process with partial mapping(w/o CONTPTE), it = is > > DoubleMapped. > > There are 2 problems with your proposal, as I see it; > > 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is > concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec > entire_mapcount. The arch code is opportunistically and *transparently* m= anaging > the CONT_PTE bit. > > 2) There is nothing to say a folio isn't *bigger* than the contpte block;= it may > be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M= ) and > be mapped with 32 contpte blocks. So you can't say it is entirely mapped > unless/until ALL of those blocks are set up. And then of course each bloc= k could > be unmapped unatomically. > > For the PMD case there are actually 2 properties that allow using the > entire_mapcount optimization; It's atomically mapped/unmapped through the= PMD > and we know that the folio is exactly PMD sized (since it must be at leas= t PMD > sized to be able to map it with the PMD, and we don't allocate THPs any b= igger > than PMD size). So one PMD map or unmap operation corresponds to exactly = one > *entire* map or unmap. That is not true when we are PTE mapping. well. Thanks for clarification. based on the above description, i agree the current code might make more sense by always using mapcount in subpage. I gave my proposals as I thought we were always CONTPTE size for small-THP then we could drop the loop to iterate 16 times rmap. if we do it entirely, we only need to do dup rmap once for all 16 PTEs by increasing entire_map. BTW, I have concerns that a variable small-THP size will really work as userspace is probably friendly to only one fixed size. for example, userspace heap management might be optimized to a size for freeing memory to the kernel. it is very difficult for the heap to adapt to various sizes at the same time. frequent unmap/fre= e size not equal with, and particularly smaller than small-THP size will defeat all efforts to use small-THP. > > > > > Since we always hold ptl to set or drop CONTPTE bits, set/drop is > > still atomic in a > > spinlock area. > > > >> > >>>> > >>>> But that can be added on top, and I'll happily do that. > >>>> > >>>> -- > >>>> Cheers, > >>>> > >>>> David / dhildenb > >>> > > Thanks Barry