From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5BEBC4167B for ; Mon, 27 Nov 2023 11:07:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2B4246B02DF; Mon, 27 Nov 2023 06:07:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2640D6B02EB; Mon, 27 Nov 2023 06:07:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 12D966B02EE; Mon, 27 Nov 2023 06:07:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 034DE6B02DF for ; Mon, 27 Nov 2023 06:07:55 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id D2733801D8 for ; Mon, 27 Nov 2023 11:07:54 +0000 (UTC) X-FDA: 81503459268.04.8180230 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf13.hostedemail.com (Postfix) with ESMTP id 0038B20024 for ; Mon, 27 Nov 2023 11:07:51 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf13.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701083272; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=auy7vh0Qd+Og8KFO+sDQoXmDHPlSta/DC63zIpq32k0=; b=i8WrVNkK+NZ08c9T2q017WpGD51RfmzOzpXOfm+2IfV3LPGAiL7gGZPtBd7XFXuw+A1kM5 JtkSfm+zTMt7NbgnquYbh8l2mbIUr3NLYvwXvsBI1kj8gketQlIABA/+gkIb2pvGa2h/gY qLy1+XdJl602LGJbf2dkc/l1X9dJuko= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf13.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701083272; a=rsa-sha256; cv=none; b=gM1E+sLpVGzRABVurYm6ycddmzfvlfih38i301Xpz8jQ7qo8AFCZ6wNQwA1FPOS2ynJgVa C7ZHT8G1sjOckS68qEEhbwnFcldKon9D5TIQnlomnGiuV4VKD2bZRImvZ2rg56Xt3tFWKI YHh18siF6fvReyOGtialtRET1AVK+Gw= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A94E42F4; Mon, 27 Nov 2023 03:08:38 -0800 (PST) Received: from [10.57.73.191] (unknown [10.57.73.191]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C69C63F73F; Mon, 27 Nov 2023 03:07:47 -0800 (PST) Message-ID: Date: Mon, 27 Nov 2023 11:07:46 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork() Content-Language: en-GB To: Barry Song <21cnbao@gmail.com> Cc: david@redhat.com, akpm@linux-foundation.org, andreyknvl@gmail.com, anshuman.khandual@arm.com, ardb@kernel.org, catalin.marinas@arm.com, dvyukov@google.com, glider@google.com, james.morse@arm.com, jhubbard@nvidia.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mark.rutland@arm.com, maz@kernel.org, oliver.upton@linux.dev, ryabinin.a.a@gmail.com, suzuki.poulose@arm.com, vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com, will@kernel.org, willy@infradead.org, yuzenghui@huawei.com, yuzhao@google.com, ziy@nvidia.com References: <271f1e98-6217-4b40-bae0-0ac9fe5851cb@redhat.com> <20231127084217.13110-1-v-songbaohua@oppo.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 0038B20024 X-Stat-Signature: x6phmctnko3o19hiwidssg3uw957od5o X-Rspam-User: X-HE-Tag: 1701083271-204653 X-HE-Meta: U2FsdGVkX19nNW4yw4yXz8aP/15PZwrS0ylYY1MGGjrImsjAJZZmu4QAsaJpsZrRQ/YGIL/fbU9+uyWFy9JH77/jvnv1gWSUrxdZGR0pjgs1xgNCLhPA6Nx0AxLlW5LnQ+U0nPTN1f7MaOZAKiQQn+xk7RWG2r6L+i6Giihbx6xi6Y4WgruE6aJa0heUToDESiCDtZFJB2f9v6IMdKNJ3d7KuJbgNlAFKusWpfTqThb9almCg2WH2jk2MZeYgYexVxKL8FfImKfDQtImTOPhSJAQEVV/GDSxj6YdQRNXdvTQ0rZRe6Zqx3sIwxJjOhiVE4FTr4yfnnlw4jaNhWHwn56H/yq6cNEqVG9VvCN6N7MMM4FRD7Qxs5vJ9p1hmUD2ApDwKtGyGQ71UeQQLoRUX8s0hByW6DtQQntBZZkmTU0ClwmTgvkYBm/d1UpffSM0vNT+GjgfPbY8lswchcFATnZKkuNTd3+YeE9me6g02WUoipzLwZnqEaap+waQypGQg6furGvj1iO2S9V3IbYioEo/lweXeDl8olz7SK31mV7w4L+to0W16I9tG424sD+dLXz3vqMoXx/ifU7mK41K+zMHB8I/i8NT+Xm8y03DsLgT5xO6u8W0f5F1pc+pEWnQwjNRoLaZwMrOCfxMjb4pWxCC4Bpxgxmw059egc14sflU1LybczBECMTXU2ENYc6AF41H5wkiRmubp8DQQMZaL24Z4dhBW7Cu15XNv5rBMg9lkmhMR1NAb1LKWIg+QiXT7mRjg+Dc9BdH6zjqZHoaRa3ioxYjaNiIrprXmuV6xNwck/HWIjhWeEX2s84998/a89sulE/jMvfTvHBGA5Qhxv+52CYaBcoeYO3UclrLPeYMO2GcHa5qxd+7oLvfKCvJ+Pzw0gKo7BQamhH1k+L4J6mOIUNc+76Bm3TATiHFh63oo9MIJiv2gHsyiVSa953Chdg1QYlfSzaNtSFVQHS 5zCe3yKG dvLDEjQ5elHvrVguBerS70G6weOM2fGkd9f83qOPZXQoc+a+zRNe7Gxy4paHYiPv+RV0t2bVtBDy7bUq70h/g/WDYNJaDcZ9JnMGuNX8EdhTyUVgHJ6awT+D3K2fbc2ct+ipCVJc1zry7pfQXPy5FDV3x2RUDPfsA4wldWWNkyx1ilYoy1Lk53L7Mi8C9y8VUR412s8B7MPfC5/2CNGhHvsHnQgCX9L6Zd6oy6lruzySAg7CkWX1H9LmYDpjSiyA9ODq3z8FMhjU+MEPlzunJQ5QMrxGhogu9RuhBsITudsayBRgg4v3S5bEaJyhK/RJ9a8WceBcmTJ/Wov4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 27/11/2023 10:28, Barry Song wrote: > On Mon, Nov 27, 2023 at 11:11 PM Ryan Roberts wrote: >> >> On 27/11/2023 09:59, Barry Song wrote: >>> On Mon, Nov 27, 2023 at 10:35 PM Ryan Roberts wrote: >>>> >>>> On 27/11/2023 08:42, Barry Song wrote: >>>>>>> + for (i = 0; i < nr; i++, page++) { >>>>>>> + if (anon) { >>>>>>> + /* >>>>>>> + * If this page may have been pinned by the >>>>>>> + * parent process, copy the page immediately for >>>>>>> + * the child so that we'll always guarantee the >>>>>>> + * pinned page won't be randomly replaced in the >>>>>>> + * future. >>>>>>> + */ >>>>>>> + if (unlikely(page_try_dup_anon_rmap( >>>>>>> + page, false, src_vma))) { >>>>>>> + if (i != 0) >>>>>>> + break; >>>>>>> + /* Page may be pinned, we have to copy. */ >>>>>>> + return copy_present_page( >>>>>>> + dst_vma, src_vma, dst_pte, >>>>>>> + src_pte, addr, rss, prealloc, >>>>>>> + page); >>>>>>> + } >>>>>>> + rss[MM_ANONPAGES]++; >>>>>>> + VM_BUG_ON(PageAnonExclusive(page)); >>>>>>> + } else { >>>>>>> + page_dup_file_rmap(page, false); >>>>>>> + rss[mm_counter_file(page)]++; >>>>>>> + } >>>>>>> } >>>>>>> - rss[MM_ANONPAGES]++; >>>>>>> - } else if (page) { >>>>>>> - folio_get(folio); >>>>>>> - page_dup_file_rmap(page, false); >>>>>>> - rss[mm_counter_file(page)]++; >>>>>>> + >>>>>>> + nr = i; >>>>>>> + folio_ref_add(folio, nr); >>>>>> >>>>>> You're changing the order of mapcount vs. refcount increment. Don't. >>>>>> Make sure your refcount >= mapcount. >>>>>> >>>>>> You can do that easily by doing the folio_ref_add(folio, nr) first and >>>>>> then decrementing in case of error accordingly. Errors due to pinned >>>>>> pages are the corner case. >>>>>> >>>>>> I'll note that it will make a lot of sense to have batch variants of >>>>>> page_try_dup_anon_rmap() and page_dup_file_rmap(). >>>>>> >>>>> >>>>> i still don't understand why it is not a entire map+1, but an increment >>>>> in each basepage. >>>> >>>> Because we are PTE-mapping the folio, we have to account each individual page. >>>> If we accounted the entire folio, where would we unaccount it? Each page can be >>>> unmapped individually (e.g. munmap() part of the folio) so need to account each >>>> page. When PMD mapping, the whole thing is either mapped or unmapped, and its >>>> atomic, so we can account the entire thing. >>> >>> Hi Ryan, >>> >>> There is no problem. for example, a large folio is entirely mapped in >>> process A with CONPTE, >>> and only page2 is mapped in process B. >>> then we will have >>> >>> entire_map = 0 >>> page0.map = -1 >>> page1.map = -1 >>> page2.map = 0 >>> page3.map = -1 >>> .... >>> >>>> >>>>> >>>>> as long as it is a CONTPTE large folio, there is no much difference with >>>>> PMD-mapped large folio. it has all the chance to be DoubleMap and need >>>>> split. >>>>> >>>>> When A and B share a CONTPTE large folio, we do madvise(DONTNEED) or any >>>>> similar things on a part of the large folio in process A, >>>>> >>>>> this large folio will have partially mapped subpage in A (all CONTPE bits >>>>> in all subpages need to be removed though we only unmap a part of the >>>>> large folioas HW requires consistent CONTPTEs); and it has entire map in >>>>> process B(all PTEs are still CONPTES in process B). >>>>> >>>>> isn't it more sensible for this large folios to have entire_map = 0(for >>>>> process B), and subpages which are still mapped in process A has map_count >>>>> =0? (start from -1). >>>>> >>>>>> Especially, the batch variant of page_try_dup_anon_rmap() would only >>>>>> check once if the folio maybe pinned, and in that case, you can simply >>>>>> drop all references again. So you either have all or no ptes to process, >>>>>> which makes that code easier. >>>> >>>> I'm afraid this doesn't make sense to me. Perhaps I've misunderstood. But >>>> fundamentally you can only use entire_mapcount if its only possible to map and >>>> unmap the whole folio atomically. >>> >>> >>> >>> My point is that CONTPEs should either all-set in all 16 PTEs or all are dropped >>> in 16 PTEs. if all PTEs have CONT, it is entirely mapped; otherwise, >>> it is partially >>> mapped. if a large folio is mapped in one processes with all CONTPTEs >>> and meanwhile in another process with partial mapping(w/o CONTPTE), it is >>> DoubleMapped. >> >> There are 2 problems with your proposal, as I see it; >> >> 1) the core-mm is not enlightened for CONTPTE mappings. As far as it is >> concerned, its just mapping a bunch of PTEs. So it has no hook to inc/dec >> entire_mapcount. The arch code is opportunistically and *transparently* managing >> the CONT_PTE bit. >> >> 2) There is nothing to say a folio isn't *bigger* than the contpte block; it may >> be 128K and be mapped with 2 contpte blocks. Or even a PTE-mapped THP (2M) and >> be mapped with 32 contpte blocks. So you can't say it is entirely mapped >> unless/until ALL of those blocks are set up. And then of course each block could >> be unmapped unatomically. >> >> For the PMD case there are actually 2 properties that allow using the >> entire_mapcount optimization; It's atomically mapped/unmapped through the PMD >> and we know that the folio is exactly PMD sized (since it must be at least PMD >> sized to be able to map it with the PMD, and we don't allocate THPs any bigger >> than PMD size). So one PMD map or unmap operation corresponds to exactly one >> *entire* map or unmap. That is not true when we are PTE mapping. > > well. Thanks for clarification. based on the above description, i agree the > current code might make more sense by always using mapcount in subpage. > > I gave my proposals as I thought we were always CONTPTE size for small-THP > then we could drop the loop to iterate 16 times rmap. if we do it > entirely, we only > need to do dup rmap once for all 16 PTEs by increasing entire_map. Well its always good to have the discussion - so thanks for the ideas. I think there is a bigger question lurking here; should we be exposing the concept of contpte mappings to the core-mm rather than burying it in the arm64 arch code? I'm confident that would be a huge amount of effort and the end result would be similar performace to what this approach gives. One potential benefit of letting core-mm control it is that it would also give control to core-mm over the granularity of access/dirty reporting (my approach implicitly ties it to the folio). Having sub-folio access tracking _could_ potentially help with future work to make THP size selection automatic, but we are not there yet, and I think there are other (simpler) ways to achieve the same thing. So my view is that _not_ exposing it to core-mm is the right way for now. > > BTW, I have concerns that a variable small-THP size will really work > as userspace > is probably friendly to only one fixed size. for example, userspace > heap management > might be optimized to a size for freeing memory to the kernel. it is > very difficult > for the heap to adapt to various sizes at the same time. frequent unmap/free > size not equal with, and particularly smaller than small-THP size will > defeat all > efforts to use small-THP. I'll admit to not knowing a huge amount about user space allocators. But I will say that as currently defined, the small-sized THP interface to user space allows a sysadmin to specifically enable the set of sizes that they want; so a single size can be enabled. I'm diliberately punting that decision away from the kernel for now. FWIW, My experience with the Speedometer/JavaScript use case is that performance is a little bit better when enabling 64+32+16K vs just 64K THP. Functionally, it will not matter if the allocator is not enlightened for the THP size; it can continue to free, and if a partial folio is unmapped it is put on the deferred split list, then under memory pressure it is split and the unused pages are reclaimed. I guess this is the bit you are concerned about having a performance impact? Regardless, it would be good to move this conversation to the small-sized THP patch series since this is all independent of contpte mappings. > >> >>> >>> Since we always hold ptl to set or drop CONTPTE bits, set/drop is >>> still atomic in a >>> spinlock area. >>> >>>> >>>>>> >>>>>> But that can be added on top, and I'll happily do that. >>>>>> >>>>>> -- >>>>>> Cheers, >>>>>> >>>>>> David / dhildenb >>>>> >>> > > Thanks > Barry