From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 05CA6C61D97 for ; Mon, 30 Jan 2023 17:29:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3B3786B0072; Mon, 30 Jan 2023 12:29:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3899A6B0075; Mon, 30 Jan 2023 12:29:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 22B036B0078; Mon, 30 Jan 2023 12:29:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 148566B0072 for ; Mon, 30 Jan 2023 12:29:57 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id C63C4C0265 for ; Mon, 30 Jan 2023 17:29:55 +0000 (UTC) X-FDA: 80412153150.18.17EC749 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf15.hostedemail.com (Postfix) with ESMTP id 94A9DA0023 for ; Mon, 30 Jan 2023 17:29:53 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KnJEW4gh; spf=pass (imf15.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675099793; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Nzvbz1NKqVKi+abl4WsC/QDWi63/0y+DCA+e8z5/qGM=; b=XmWt25PGFZ3qdL9Qp0byCykTy5PGXSY4+bbnhmA1ZkyWT4v1BOdrdvkS/CCpipU/EJx4mo 0YmTWubrEuzVWVAOfP00dGAV2O+ioQjk3Rjk7C4s2dmkVNPblCLUM9oC+XXNvTEdDHsdyD vwzrXq9fGWEbhUcoC/XY+WyUrQaHWbc= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KnJEW4gh; spf=pass (imf15.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675099793; a=rsa-sha256; cv=none; b=DlMCP5SnI/RiCrkqqC003sSdjv0jg9iJ2kluxoMOIfbWGONB92SIoY8rL2op/SeA681c9Z wQslYilgxrYP5j8bR1D6Z4g1PhBl6R6WfiWzdpQ9XJ8rA7LIAPBKM8kmEgG/KQwh//iuk6 AfNHuUUZMHbsGmGYnBfh99AijFsWa9E= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675099793; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Nzvbz1NKqVKi+abl4WsC/QDWi63/0y+DCA+e8z5/qGM=; b=KnJEW4ghiXZVn5zGAKdWTiWJForGQO/GlQfvak/by6A5epWoXBNh8JWPIoelBkWlJlaTJB /3ShZkNGU64vyHu5uYQ36IDHO/GAR9AIeau4tm+P7gEkt3w+J/jru1q6SWT6/AjWg3o3Js KbfRY8d0HQbXHU8C3BCcqiwAbN+W0y0= Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com [209.85.160.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-630-nIGD3TjQMsa9dEKErZJwGQ-1; Mon, 30 Jan 2023 12:29:51 -0500 X-MC-Unique: nIGD3TjQMsa9dEKErZJwGQ-1 Received: by mail-qt1-f198.google.com with SMTP id a24-20020ac84d98000000b003b9a4958f0cso433191qtw.3 for ; Mon, 30 Jan 2023 09:29:51 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=Nzvbz1NKqVKi+abl4WsC/QDWi63/0y+DCA+e8z5/qGM=; b=Tutz2FBoxK3dDvKao2rT0Dzh+tJnWb1GXiNizEwEabG9ceAr/8OSvajvINbR3j3V8K 2579VPfDQEzfGKzeFCanIQHXAywPTFDz7NZGFEF+vxwqBba4B/kSovmIm4UPtuvesqIb pn4mZNkZou6hHhPe2UgyQGUdknoc0ynFECXwrvDjGVKTkmsfXaqGTH3JbHcB7+I+7nF+ rY5T/fv6rvAOJ/Ehf7H5qIP26uOT0qQFz7iO2f4eOx0IU6hWGfV48pu2wECO0FH3vXTa SY/A0vQnO9jIKXMaOP5lXP4KGsXwumAH2Y+kLulYx4OCn4+UJ6hrdc5P3LGcM/P0G3Uz LpOg== X-Gm-Message-State: AO0yUKW3tcVV089NKIiafpiLpezZSWFI3YUW/JbJbpc76gATdFoeBfTc 6A4mVOE1FAOmaMXNsA4Dl2PIGkJCVkcj9tIRXaK/yRSwsIJw4EN8JqvSmgaVB+dn86NG48eCy+q mE1rEd/v1YiQ= X-Received: by 2002:ac8:5bd2:0:b0:3b8:4076:1de0 with SMTP id b18-20020ac85bd2000000b003b840761de0mr13639861qtb.30.1675099791429; Mon, 30 Jan 2023 09:29:51 -0800 (PST) X-Google-Smtp-Source: AK7set+2k1hXS0q2E9rprxORf4uJTTuCIK9G6nx/z6sdk9pilq7T9/w6Og994AzOiSeu29hB4FEwDA== X-Received: by 2002:ac8:5bd2:0:b0:3b8:4076:1de0 with SMTP id b18-20020ac85bd2000000b003b840761de0mr13639829qtb.30.1675099791120; Mon, 30 Jan 2023 09:29:51 -0800 (PST) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id u17-20020ac83d51000000b003b82a07c4d6sm6122247qtf.84.2023.01.30.09.29.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 30 Jan 2023 09:29:50 -0800 (PST) Date: Mon, 30 Jan 2023 12:29:48 -0500 From: Peter Xu To: James Houghton Cc: Mike Kravetz , David Hildenbrand , Muchun Song , David Rientjes , Axel Rasmussen , Mina Almasry , Zach O'Keefe , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range Message-ID: References: <6548b3b3-30c9-8f64-7d28-8a434e0a0b80@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 94A9DA0023 X-Stat-Signature: zwgcab6f3r4g4zcochpo995cidx5xu31 X-HE-Tag: 1675099793-261243 X-HE-Meta: U2FsdGVkX1/xrPafJ4gyh+8E4eie4M2wGGlXe61xcMtBmUsC76BfzcUV67pbDg+m5D/EvMXEWTb/PYnyIAGzSt0WuKWT0xy2G5zCOF5jCIhv53Pq+crJxP+buqLOczCDEKNCuVk2RTQpD/2YOj5LMuSoRkbd1RIbkPd/UgSkOSsB6oN2yk/tEbFiWE3gzeCM6ID8yMpsirH9XdegcPWoBnLQT6rxfJVpBMlAIqAAwm9M47RAi+HRdDsrjWXLOosBP9fwG+IDwb9P6yar19W0TVKFToeruDRCq/hFsINHL2zkb3z1cW/vq7kGelgq03SyZsrzyJXkhHazn+E3K6JqMdhohA9b9yRlrwptYGvgrTfa5b799xtP0FP/uXfqysrJqyO1wQ/+psCOCSgsYuWonFr5bPYGvSOU0M07GT21hX8yw/ht6lxNuOzVIhKqpqhaioKYsWUspA7ZhrDxq8GDH2W0iO5gXviHvKbRN2anGbjJ/DulZANygqzmkbGL+rzv+ilJPsLcJKs1qHufb5oG8dwkrN6xy74aikmMdVmvHnSzAIfLffkElPk/2M+eF7yJ0rx9yJEswPaPyjp83galZXFg2Olbh2QhFot7LlCE0EYQvJbRSJzW7q/T17g8zqkAzUZ70qA06+XnA5m+YY80IYh9fOEfYenZogD5gVgAxIK/g0y9/Ao1Nx6U9saEHejBk6j9WCgiOrLUzCKPB6cGLFtJH8K2MWZwPC532eBU86BaS9eQsrQHVO6AT0rFHqgwnG11mBRdMBX1SQR4/NsfMnfJFpH+sfl53QnqzCEqrPhZAOIhEgIJm2xKVwoPMvN9Z14uv+AXZR6P0HcR8Po4nSqqWcshl1tnv9FTZx4nO6OgKd/sd60GRt0wkda9a8U+7IeJxO5IEfBZrQlDr3J9CHcoBrfKC7E1XnjKs/fzQenIqguZkU72bTyRwAbIM0I65ybEgq4rWSq+BBpDm3i Sd03H0Tw A9C61RREoQ7TKL5WK4Vb9vE3YjirGEW8IOrjdzyXY3AOxYMyC59BvcRC+DqHdamuZ5VB5oXnUWFqSNd10aRiI4Bj3pQCsBLmIGXSI/PRu6eoktyq9InN1cHwJYusp6fc2WGi0vYqsURa1aRREaHTRLIpHd/t3FTFinforY6nDHvNIZsNhfe8LwWwiTDt1VT3GPwB9eRT8u3ztp6Qb9mt0lZvirXikeYOD3S2+AUdRSrDK8he0JpJSFit4czIQn6mme7tXpHc6PQmzfKs7warzCKnXr4hz+WrOAGqFYVCW30xtsV/3YdDOCoStD+/1snSRInOfh/CVRHaQhSRf72MQJ2uyj4QWCGQa+ES+KiFJ5+UV7pKtVRz4rkdJGcBu2K8Pq14y8pVhH6Oj5I7IfhBa6h5Y9T66ZkAAaee12FtrVjgiJWbP31YiIc+0/SsCr/lu24FpIWrN4SB7M8u6HzhYCdJMCE9dQEL/66+m X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote: > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu wrote: > > > > James, > > > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote: > > > It turns out that the THP-like scheme significantly slows down > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense. > > > > > > This is only really a problem because this is done between > > > mmu_notifier_invalidate_range_start() and > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to > > > access any of the 1G page while we're doing this (and it can take like > > > ~1 second for each 1G, at least on the x86 server I was testing on). > > > > Did you try to measure the time, or it's a quick observation from perf? > > I put some ktime_get()s in. > > > > > IIRC I used to measure some atomic ops, it is not as drastic as I thought. > > But maybe it depends on many things. > > > > I'm curious how the 1sec is provisioned between the procedures. E.g., I > > would expect mmu_notifier_invalidate_range_start() to also take some time > > too as it should walk the smally mapped EPT pgtables. > > Somehow this doesn't take all that long (only like 10-30ms when > collapsing from 4K -> 1G) compared to hugetlb_collapse(). Did you populate as much the EPT pgtable when measuring this? IIUC this number should be pretty much relevant to how many pages are shadowed to the kvm pgtables. If the EPT table is mostly empty it should be super fast, but OTOH it can be much slower if when it's populated, because tdp mmu should need to handle the pgtable leaves one by one. E.g. it should be fully populated if you have a program busy dirtying most of the guest pages during test migration. Write op should be the worst here case since it'll require the atomic op being applied; see kvm_tdp_mmu_write_spte(). > > > > > Since we'll still keep the intermediate levels around - from application > > POV, one other thing to remedy this is further shrink the size of COLLAPSE > > so potentially for a very large page we can start with building 2M layers. > > But then collapse will need to be run at least two rounds. > > That's exactly what I thought to do. :) I realized, too, that this is > actually how userspace *should* collapse things to avoid holding up > vCPUs too long. I think this is a good reason to keep intermediate > page sizes. > > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a > huge difference: the THP-like scheme is about 30% slower overall. > > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE > difference. For the THP-like scheme, collapsing 4K -> 2M requires > decrementing and then re-incrementing subpage->_mapcount, and then > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For > the head-only scheme, for each 2M in the 4K -> 2M collapse, we > decrement the compound_mapcount 512 times (once per PTE), then > increment it once. And then for 2M -> 1G, for each 1G, we decrement > mapcount again by 512 (once per PMD), incrementing it once. Did you have quantified numbers (with your ktime treak) to compare these? If we want to go the other route, I think these will be materials to justify any other approach on mapcount handling. > > The mapcount decrements are about on par with how long it takes to do > other things, like updating page tables. The main problem is, with the > THP-like scheme (implemented like this [1]), there isn't a way to > avoid the 262k decrements when collapsing 1G. So if we want > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API, > then I think something more clever needs to be implemented. > > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178 I believe the whole goal of HGM is trying to face the same challenge if we'll allow 1G THP exist and being able to split for anon. I don't remember whether we discussed below, maybe we did? Anyway... Another way to not use thp mapcount, nor break smaps and similar calls to page_mapcount() on small page, is to only increase the hpage mapcount only when hstate pXd (in case of 1G it's PUD) entry being populated (no matter as leaf or a non-leaf), and the mapcount can be decreased when the pXd entry is removed (for leaf, it's the same as for now; for HGM, it's when freeing pgtable of the PUD entry). Again, in all cases I think some solid measurements would definitely be helpful (as commented above) to see how much overhead will there be and whether that'll start to become a problem at least for the current motivations of the whole HGM idea. Thanks, -- Peter Xu