From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 936A6C05027 for ; Fri, 3 Feb 2023 13:41:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 33E836B0072; Fri, 3 Feb 2023 08:41:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2EF086B0073; Fri, 3 Feb 2023 08:41:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1B7206B0074; Fri, 3 Feb 2023 08:41:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 0E5736B0072 for ; Fri, 3 Feb 2023 08:41:00 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id D869BA04C3 for ; Fri, 3 Feb 2023 13:40:59 +0000 (UTC) X-FDA: 80426091438.08.FC429A6 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by imf26.hostedemail.com (Postfix) with ESMTP id D882014001A for ; Fri, 3 Feb 2023 13:40:57 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=qcmucqr6; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf26.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675431658; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tY5edR91yzo/JtbvHeQc/omNEXLCescxC3sbLjJzUk4=; b=t03EIENAnnUQP8XoEuQlG4lOEMdukjK7tW2amhCjdU5v+uq+bGd9hmqOySlh9cjulOAsJA ++/LNGhQKUEE12793Y4z8cJiqoMj4fmI6PsfM47MCjGcDh9zcVrfWxrJszBhG6nBxhiUVL pekw0/wymaJ9dzD/HYqOjq9l0ezy18s= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=qcmucqr6; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf26.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.29 as permitted sender) smtp.mailfrom=mhocko@suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675431658; a=rsa-sha256; cv=none; b=yWI+URhEDbPXY1zDOVPuAUO+53SqebsVV8yXcAXCLsvQfY6BItoBhFU4bNzI7ouU9urc9b qICpvZ3AvE+YlfvnzpDgn5I6B6hQ6wyPA6D17iXwAwNPomYXhNs7HGHi13U5JSUPQcxujR oYD+3NLIY9vCYszTiGle7XOgBzJbyYQ= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 47B6020E35; Fri, 3 Feb 2023 13:40:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1675431656; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=tY5edR91yzo/JtbvHeQc/omNEXLCescxC3sbLjJzUk4=; b=qcmucqr6byPGYqRE8EiCPiPdWai+MGsFAUXL7NvhdbLTyngY9UK7+TuwqeEAWt3XOtFxp/ r1yDjFhQ10f2PFTdFL3KQgFNY3zidsK8NZS1rgX9Ff46yTSBWqNo3/lut30TRXyQ6VLGVY 2cx4dW0XnwnzOdVCVoLAA71/mmyR+FM= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 1D9801346D; Fri, 3 Feb 2023 13:40:56 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id b15NBOgO3WMmCgAAMHmgww (envelope-from ); Fri, 03 Feb 2023 13:40:56 +0000 Date: Fri, 3 Feb 2023 14:40:55 +0100 From: Michal Hocko To: Mike Kravetz Cc: Andrew Morton , David Hildenbrand , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Naoya Horiguchi , James Houghton , Peter Xu , Yang Shi , Vishal Moola , Matthew Wilcox , Muchun Song , stable@vger.kernel.org Subject: Re: [PATCH 1/2] mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps Message-ID: References: <20230126222721.222195-1-mike.kravetz@oracle.com> <20230126222721.222195-2-mike.kravetz@oracle.com> <4ad5163f-5368-0bd8-de9b-1400a7a653ed@redhat.com> <20230127150411.7c3b7b99fa4884a6af0b9351@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: D882014001A X-Stat-Signature: wmj5ceigz8hjtd9mn9d9s54tdhca3px1 X-HE-Tag: 1675431657-601107 X-HE-Meta: U2FsdGVkX1/Nic5Vc/ysLcBg/LGrF9DxVA2Fll2dvOOLqlay7vNBX8bBJjDhYASegLMHb0Q1G0WkI5+3zNrHadtfKhCosoluaz9t6qgwKCs25sAarFVNHvVAdRFj6j7VggZc9SX1YuoK5+kKC0cCktRnXUuNntpieANuD9UiPnCIOWE8hsMmiQSYA1z2lVYac3JNRMNjZcMtz6+NJ1XsfXmJVexP7E+dyiKOGK4i8Ad0s9EjzA70Ymmpo1x2kK6mfooig34jkxsOM+5jvHus8v7EKEGyKNjNJ3ydMjjR/uWUahFyl22sl6VbzSyTqyi1WuCZraN2S/xEBmMvc5HZw5i3l19J4m4eJz4KxhandYw9RioSScYidkI8Bfg2DbsWeqvIswkj7AkAZCVoDLPWhZ3GdSl83NNYRCgpBqBh6AFL1dHOIuNoHZX+YzedZKa0JINnHk+QIRPaR4bjNJbAKLSZyA9bxRVrY0I/LP963H9g936lViXgrwV5WAu5XR5Xaecc4+tIv9dBybpacnqZEHkfOFYZloY2Qysmt2JYAx3q2rTsE/qltZycsjU4uKRi5yX10ewui0XHuc1Bmuqiu+FalXG5TCpAURUNR7dvMRdB7+yG7s6Jrm7Y5qmO+zzWr7NAUz1uhxJ0K9WqiI33nUUYY7kKYmRNzzSi7/jxpR3ifQ6ZRBumEmIzlytz9IVHhP3avZxssNvwTvdvQ8Xseb6VX+A4p27YT8WAlv8kriTLvUcKdiQy6f+Bu3lUJzaEeyeZ0WsXNayTHFp2J7aQtGFaqHFfyqvKxDOmLanC7A6ICj62FlDwVwve/Ys3zmOi8kZTXK0O5PBgFAXcxtcuJZ8voQCs4kNzjYUfLryPiTrW+vm/6/FMcDJ4kQq2cvalJLj0GWCVFuZU8fYHCMuW6avaaQ1/U4PUDKiL5uUn/5kgSMk9nJPzqbyOZrK+BrM//E/R1d4g32QiyiF5i95 MT6VcpGe 8QT0vjrC2gQx93tD7+PzQptS3oRo8bscVhIwXtDRt4GyrfLrxV7mqwKOypoGnphDy/RIwlmoTkJZL5bUxBB416kyTo0Ga/GHFdFjnybJX4xtU9uG5T0s3jRAydq0MjzrW3NG/rKDpKwmo4jolUR+oFGVWplecdkPOt3IIWRKOpdQuNK6FQWmvQqhQu3NEdMpZytiy/b8lasGT9bcPi3p45vRhfcLaGvCSQDzO8p7a39gUk7uUXO0fDZzHBdBwvVYI/5o+M9ZJNu2Bjc73zATSd2lem7aNUyJ/xl2yxVJDt1SXS8sCLBv/VET5phfjo6eFl65+WCu5j1yrvsM6zohKAIhS2E23LtIiIdEZ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed 01-02-23 13:05:35, Mike Kravetz wrote: > On 02/01/23 08:47, Michal Hocko wrote: > > On Mon 30-01-23 14:08:47, Mike Kravetz wrote: > > > On 01/30/23 13:36, Michal Hocko wrote: > > > > On Fri 27-01-23 17:12:05, Mike Kravetz wrote: > > > > > On 01/27/23 15:04, Andrew Morton wrote: > > > > > > On Fri, 27 Jan 2023 17:23:39 +0100 David Hildenbrand wrote: > > > > > > > On 26.01.23 23:27, Mike Kravetz wrote: > > > > > > > > Yes, this looks simple enough. My only concern would be that this > > > > special casing might be required on other places which is hard to > > > > evaluate. I thought PSS reported by smaps would be broken as well but it > > > > seems pss is not really accounted for hugetlb mappings at all. > > > > > > > > Have you tried to look into {in,de}creasing the map count of the page when > > > > the the pte is {un}shared for it? > > > > > > A quick thought is that it would not be too difficult. It would need > > > to include the following: > > > - At PMD share time in huge_pmd_share(), > > > Go through all entries in the PMD, and increment map and ref count for > > > all referenced pages. huge_pmd_share is just adding another sharing > > > process. > > > - At PMD unshare time in huge_pmd_unshare(), > > > Go through all entries in the PMD, and decrement map and ref count for > > > all referenced pages. huge_pmd_unshare is just removing one sharing > > > process. > > > - At page fault time, check if we are adding a new entry to a shared PMD. > > > If yes, add 'num_of_sharing__processes - 1' to the ref and map count. > > > > > > In each of the above operations, we are holding the PTL lock (which is > > > really the split/PMD lock) so synchronization should not be an issue. > > > > > > Although I mention processes sharing the PMD above, it is really mappings/vmas > > > sharing the PMD. You could have two mappings of the same object in the same > > > process sharing PMDs. > > > > > > I'll code this up and see how it looks. > > > > Thanks! > > > > > However, unless you have an objection I would prefer the simple patches > > > move forward, especially for stable backports. > > > > Yes, the current patch is much simpler and more suitable for stable > > backports. If the explicit map count modifications are not all that > > terrible then this would sound like a more appropriate long term plan > > though. > > The approach mentioned above seems to be simple enough. Patch is below. > > I 'tested' with the same method and tests used to measure fault scalabilty > when developing vma based locking [1]. I figured this would be a good stress > of the share, unshare and fault paths. With the patch, we are doing more > with the page table lock held, so I expected to see a little difference > in scalability, but not as much as actually measured: > > next-20230131 > test instances unmodified patched > -------------------------------------------------------------------------- > Combined faults 24 61888.4 58314.8 > Combined forks 24 157.3 130.1 So faults are 6 % slower while forks are hit by 18%. This is quite a lot and more than I expected. pmd sharing shouldn't really be a common operation AFAICS. It should only happen with the first mapping in the pud (so once every 1GB/2MB faults for fully populated mappings). It would be good to know whether this is purely lock contention based or the additional work in each #pf and unmapping makes a big impact as well. > These tests could seem a bit like a micro-benchmark targeting these code > paths. However, I put them together based on the description of a > customer workload that prompted the vma based locking work. And, performance > of these tests seems to reflect performance of their workloads. > > This extra overhead is the cost needed to make shared PMD map counts be > accurate and in line with what is normal and expected. I think it is > worth the cost. Other opinions? Of course, the patch below may have > issues so please take a look. If 18% slowdown really reflects a real workload then this might just be too expensive I am afraid. > [1] https://lore.kernel.org/linux-mm/20220914221810.95771-1-mike.kravetz@oracle.com/ > > > >From bff5a717521f96b0e5075ac4b5a1ef84a3589b7e Mon Sep 17 00:00:00 2001 > From: Mike Kravetz > Date: Mon, 30 Jan 2023 20:14:14 -0800 > Subject: [PATCH] hugetlb: Adjust hugetlbp page ref/map counts for PMD sharing > > When hugetlb PMDS are shared, the sharing code simply adds the shared > PMD to another processes page table. It will not update the ref/map > counts of pages referenced by the shared PMD. As a result, the ref/map > count will only reflect when the page was added to the shared PMD. Even > though the shared PMD may be in MANY process page tables, ref/map counts > on the pages will only appear to be that of a single process. > > Update ref/map counts to take PMD sharing into account. This is done in > three distinct places: > 1) At PMD share time in huge_pmd_share(), > Go through all entries in the PMD, and increment map and ref count for > all referenced pages. huge_pmd_share is just adding another use and > mapping of each page. > 2) At PMD unshare time in huge_pmd_unshare(), > Go through all entries in the PMD, and decrement map and ref count for > all referenced pages. huge_pmd_unshare is just removing one use and > mapping of each page. > 3) When faulting in a new hugetlb page, > Check if we are adding a new entry to a shared PMD. If yes, add > 'num_of_sharing__processes - 1' to the ref and map count. Honestly, I didn't really have much time to think about this very deeply so I might be missing something here. The patch seems correct to me. adjust_shared_pmd_page_counts's delta parameter is confusing because it implies a delta adjustments while it justs want to be "bool increase" instead. Thanks for looking into this Mike! [...] > +static void adjust_shared_pmd_page_counts(pmd_t *pmd_start, int delta) > +{ > + struct folio *folio; > + struct page *page; > + pte_t *ptep, pte; > + int i; > + > + for (i= 0; i < PTRS_PER_PMD; i++) { > + ptep = (pte_t *)(pmd_start + i); > + > + pte = huge_ptep_get(ptep); > + if (huge_pte_none(pte) || !pte_present(pte)) > + continue; > + > + page = pte_page(pte); > + folio = (struct folio *)page; > + if (delta > 0) { > + folio_get(folio); > + atomic_inc(&folio->_entire_mapcount); > + } else { > + folio_put(folio); > + atomic_dec(&folio->_entire_mapcount); > + } > + } > +} [...] -- Michal Hocko SUSE Labs