From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A2111C4167B for ; Mon, 27 Nov 2023 10:11:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 212026B02B9; Mon, 27 Nov 2023 05:11:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C3E86B02BA; Mon, 27 Nov 2023 05:11:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 064786B02BB; Mon, 27 Nov 2023 05:11:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E780F6B02B9 for ; Mon, 27 Nov 2023 05:11:06 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id BA981160147 for ; Mon, 27 Nov 2023 10:11:06 +0000 (UTC) X-FDA: 81503316132.13.5F55F91 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf20.hostedemail.com (Postfix) with ESMTP id B981C1C0006 for ; Mon, 27 Nov 2023 10:11:04 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=none; spf=pass (imf20.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701079865; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kyHRRcHUguYRx9oX7ITb2wGi065yquMSFwaardSk8Fk=; b=kNsqEcFqlhG9AQ6KiCLJrsW1WT1KcPkNHvrJ4UDC1NZQTfKft2szVKRAnDh6U+6OKigPYg kpvApP51Fbv2ZJT0NvPdGln8yyB1ECSwZpfof7i9QDtypDUHferDfMZtpNT1WOx1R4pJ9E fQd2TMB6fmuJCIZBrk8EQSzHJjLfS5c= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701079865; a=rsa-sha256; cv=none; b=ZRBJ9KWd5VJIrt7E/6vihSoVrmbCaOCuHIDlxiTDekFvc2sCUY7elYsPKyf12K0EPZTjBs 3+q2CB4SvUQXJmCeQVbkJlt7Hy8Xkb4Jxn/ziNL2s1V4mUHAtbYBKeo9oXA9xylaXY5b9r zFSODJ0W+bcYX89odDocg6Sjg/2CKmk= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=none; spf=pass (imf20.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2F4262F4; Mon, 27 Nov 2023 02:11:51 -0800 (PST) Received: from [10.57.73.191] (unknown [10.57.73.191]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 3DE5F3F73F; Mon, 27 Nov 2023 02:11:00 -0800 (PST) Message-ID: Date: Mon, 27 Nov 2023 10:10:58 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork() Content-Language: en-GB To: Barry Song <21cnbao@gmail.com> Cc: david@redhat.com, akpm@linux-foundation.org, andreyknvl@gmail.com, anshuman.khandual@arm.com, ardb@kernel.org, catalin.marinas@arm.com, dvyukov@google.com, glider@google.com, james.morse@arm.com, jhubbard@nvidia.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mark.rutland@arm.com, maz@kernel.org, oliver.upton@linux.dev, ryabinin.a.a@gmail.com, suzuki.poulose@arm.com, vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yuzenghui@huawei.com, yuzhao@google.com, ziy@nvidia.com References: <271f1e98-6217-4b40-bae0-0ac9fe5851cb@redhat.com> <20231127084217.13110-1-v-songbaohua@oppo.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: ns5njd8n835qb4ka3ppbztmogzzdiwhw X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: B981C1C0006 X-Rspam-User: X-HE-Tag: 1701079864-333898 X-HE-Meta: U2FsdGVkX1+V4OxcXRwvrX3pfLGGUCeKGLAEk9DNx9wtEmO8H/utHbYiXao6U0CcUuRvwpfzN4RCCMvvcVmDsm2BTD0QkdVxyNfYVro0egTHhvoF+tS9si+Zm3xQZ8ryTTDv8sJdqsczdzSHPGYzuBgVNpLbvbvKjcOPkgmmWj2rYcEwOCm+J2K/yU95DSXsb+kAySCH2P6FBmq7iSK+3S2K3ZXK6B8BiNcho3VFsXHmvBoaph8utJ7cKpMInkKgxdMp3c45UvjGDLoLOh4xLi/lhcxvPNtjIZCUxvDxY6Q9FDtpzC2IXsVauCapzKxVQJB0XRBRxwRmiTFuv3Ifra30O3jSNFkeHkqEHZFxxcZPPmQfZjq3G4eE0os5mVLRKmhuPuZ1nVgg7KELiNBywDnu1Gj7vyvqH23kZ/2tAvXoRSxmfezasyjDcGKSBoNMXry3kM8ciOnIkXjCjFLGzqpd5dKENnTa8o6QVrxGnNhdxRXMRYq8LpF1wcz286877kg5V0rC67zROXRiqE+mR/CHKEBsg4EwX8cOHhE1ODU9cER5DCMcLZirC9KEs/PN35RkAoFWyf8mq+ZNLj1cb2q8pikrAMZb27WOwcAYyyM8s8cZGWIQtM/rfu7FnEB4bRLeWYE+IGMH3S4TtrF8fZO33nYL1RjYg2hqqwoFOmRUf8qjy/m6cRL2C5uhmhyA4V+c+MPt7EGMGJXogluQIh3c1/+dZw9/ro+JORNpA50ikJYCviJ+G+m7kXc74j8viqIs8E2tw6B2pp+8+mHWit4TCp5ly4iYGFjLyxpjYLNZYK28d03RRRQduzk73nmyPtj9OH3vniUgm0CYTybtNoW/zxCeGQFzWuC9JbtJSpYCcOu0VbujtOJUu08V9X5xYs5d4jbFWJQHQkIzCixenl8w7AvHge+Zem31XYOHAdmvakQf9ljgmD2x5PdsGIJe0EvERDc2dYyYtbwomp9 RfBwSobo bjROaYN9dGwIamoCgoCPp8Sk5P35e7LA9dPYpPRm0BzNVO94nr74ivgZQbjJyq/Pt85BIXC8EHrDZzeKjpZOFCrOgpPauGdvhCEkAj69aeZgj7TWbSaX2Sf97g4cWZe702z/CYfT3JQMsYUO5UnTn6wVj/7fV0faU1PCXlp5Ya6qzaEF6+XZ+D3T5G0QiWfroQoUUaaoaRZ8Nz2wMQ5XbI0atyLGbTFT9CN8UPO0UcMBZcZtKecLH2175sLIocNTptwQfTSideiuL2SpvDPYe0s/C4Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 27/11/2023 09:59, Barry Song wrote: > On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts wrote: >> >> On 27/11/2023 08:42, Barry Song wrote: >>>>> + for (i = 0; i < nr; i++, page++) { >>>>> + if (anon) { >>>>> + /* >>>>> + * If this page may have been pinned by the >>>>> + * parent process, copy the page immediately for >>>>> + * the child so that we'll always guarantee the >>>>> + * pinned page won't be randomly replaced in the >>>>> + * future. >>>>> + */ >>>>> + if (unlikely(page_try_dup_anon_rmap( >>>>> + page, false, src_vma))) { >>>>> + if (i != 0) >>>>> + break; >>>>> + /* Page may be pinned, we have to copy. */ >>>>> + return copy_present_page( >>>>> + dst_vma, src_vma, dst_pte, >>>>> + src_pte, addr, rss, prealloc, >>>>> + page); >>>>> + } >>>>> + rss[MM_ANONPAGES]++; >>>>> + VM_BUG_ON(PageAnonExclusive(page)); >>>>> + } else { >>>>> + page_dup_file_rmap(page, false); >>>>> + rss[mm_counter_file(page)]++; >>>>> + } >>>>> } >>>>> - rss[MM_ANONPAGES]++; >>>>> - } else if (page) { >>>>> - folio_get(folio); >>>>> - page_dup_file_rmap(page, false); >>>>> - rss[mm_counter_file(page)]++; >>>>> + >>>>> + nr = i; >>>>> + folio_ref_add(folio, nr); >>>> >>>> You're changing the order of mapcount vs. refcount increment. Don't. >>>> Make sure your refcount >= mapcount. >>>> >>>> You can do that easily by doing the folio_ref_add(folio, nr) first and >>>> then decrementing in case of error accordingly. Errors due to pinned >>>> pages are the corner case. >>>> >>>> I'll note that it will make a lot of sense to have batch variants of >>>> page_try_dup_anon_rmap() and page_dup_file_rmap(). >>>> >>> >>> i still don't understand why it is not a entire map+1, but an increment >>> in each basepage. >> >> Because we are PTE-mapping the folio, we have to account each individual page. >> If we accounted the entire folio, where would we unaccount it? Each page can be >> unmapped individually (e.g. munmap() part of the folio) so need to account each >> page. When PMD mapping, the whole thing is either mapped or unmapped, and its >> atomic, so we can account the entire thing. > > Hi Ryan, > > There is no problem. for example, a large folio is entirely mapped in > process A with CONPTE, > and only page2 is mapped in process B. > then we will have > > entire_map = 0 > page0.map = -1 > page1.map = -1 > page2.map = 0 > page3.map = -1 > .... > >> >>> >>> as long as it is a CONTPTE large folio, there is no much difference with >>> PMD-mapped large folio. it has all the chance to be DoubleMap and need >>> split. >>> >>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any >>> similar things on a part of the large folio in process A, >>> >>> this large folio will have partially mapped subpage in A (all CONTPE bits >>> in all subpages need to be removed though we only unmap a part of the >>> large folioas HW requires consistent CONTPTEs); and it has entire map in >>> process B(all PTEs are still CONPTES in process B). >>> >>> isn't it more sensible for this large folios to have entire_map = 0(for >>> process B), and subpages which are still mapped in process A has map_count >>> =0? (start from -1). >>> >>>> Especially, the batch variant of page_try_dup_anon_rmap() would only >>>> check once if the folio maybe pinned, and in that case, you can simply >>>> drop all references again. So you either have all or no ptes to process, >>>> which makes that code easier. >> >> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But >> fundamentally you can only use entire_mapcount if its only possible to map and >> unmap the whole folio atomically. > > > > My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped > in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise, > it is partially > mapped. if a large folio is mapped in one processes with all CONTPTEs > and meanwhile in another process with partial mapping(w/o CONTPTE), it is > DoubleMapped. There are 2 problems with your proposal, as I see it; 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec entire_mapcount. The arch code is opportunistically and *transparently* managing the CONT_PTE bit. 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and be mapped with 32 contpte blocks. So you can't say it is entirely mapped unless/until ALL of those blocks are set up. And then of course each block could be unmapped unatomically. For the PMD case there are actually 2 properties that allow using the entire_mapcount optimization; It's atomically mapped/unmapped through the PMD and we know that the folio is exactly PMD sized (since it must be at least PMD sized to be able to map it with the PMD, and we don't allocate THPs any bigger than PMD size). So one PMD map or unmap operation corresponds to exactly one *entire* map or unmap. That is not true when we are PTE mapping. > > Since we always hold ptl to set or drop CONTPTE bits, set/drop is > still atomic in a > spinlock area. > >> >>>> >>>> But that can be added on top, and I'll happily do that. >>>> >>>> -- >>>> Cheers, >>>> >>>> David / dhildenb >>> > > Thanks > Barry