From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3D19C54EAA for ; Fri, 27 Jan 2023 21:02:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0C6446B0075; Fri, 27 Jan 2023 16:02:56 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 076986B007B; Fri, 27 Jan 2023 16:02:56 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E80826B007D; Fri, 27 Jan 2023 16:02:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D9DDB6B0075 for ; Fri, 27 Jan 2023 16:02:55 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 5C627C030F for ; Fri, 27 Jan 2023 21:02:55 +0000 (UTC) X-FDA: 80401803510.16.F414DDB Received: from mail-wm1-f43.google.com (mail-wm1-f43.google.com [209.85.128.43]) by imf13.hostedemail.com (Postfix) with ESMTP id 9904720045 for ; Fri, 27 Jan 2023 21:02:45 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=STutomE6; spf=pass (imf13.hostedemail.com: domain of jthoughton@google.com designates 209.85.128.43 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674853365; a=rsa-sha256; cv=none; b=7Y1s0U9t7o5HdXe05iUGySCx9U5FnlTUkYy0Gk3OeaC4eHE17tOH4zOa8NJ16t6+thcJk+ 1JYPOLtQlX69yCycDxZG83O49S9yvupUs/V8OhzhLHFKfZjWjnAL497yx9v+Py/r7Gnqvx NO6ruMWs8Wb5QbbCyxsH4Z3x3XN4zio= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=STutomE6; spf=pass (imf13.hostedemail.com: domain of jthoughton@google.com designates 209.85.128.43 as permitted sender) smtp.mailfrom=jthoughton@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674853365; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CJ9VD42UJx69eknftpXX6RdpNul/NwkIopoYX82hUnE=; b=kIu9yCG9tGhO6GA0dmalQnFRRbafUaHv5oOksuo+b5EDd9zxSdw3DWoc5bUfv74Axug9WB +yJqOLx+CM0iIw2dfR8ZyZ24tjOnq+oGTNAPgmEyJgb/4XEP/D3oWpVGyvXHItqFpabpEK 6t7zjlqZ4UiJWBanziC0M3r7pf923WU= Received: by mail-wm1-f43.google.com with SMTP id f12-20020a7bc8cc000000b003daf6b2f9b9so6201792wml.3 for ; Fri, 27 Jan 2023 13:02:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=CJ9VD42UJx69eknftpXX6RdpNul/NwkIopoYX82hUnE=; b=STutomE6QbnMg9BRuOshCbkA33xoKYjYOD6zV+8g0CrO8O4J++ls51gbTYckGVxbVx k8Z4rieomHdVPxcdQpVCgiat/6iGe+MbNLsFrXsJ+CVY4YJl0zv5/Fehmzy9tGmrF5Ng tVmj/L5tbilfg5sg/OZ4Ot3NBWjYrS7gmFj3A4BTzpQJN4X5lTW3cS0nqSgM+x1BZDLy lKuVEfvKeo6F0YG2z1k+A+S9maamfWKqeFi/aZo8OOKB0rqchBQad9UIHhmfKLD7XS89 VS2j9jtS3aApjRTbXQZSy0i1wfig/CcJDcvd0s2B/hJElkMaRJ838zQVw36T1v7O927m qNjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=CJ9VD42UJx69eknftpXX6RdpNul/NwkIopoYX82hUnE=; b=z9yr8someBiQxLUii54cJxuciQmbyjt/s25H08BN+L9fDl8fzaWwE6YJalZ8b/kyzc XV8YjVNXq2H/rIhQ7DIzMiziMiE6SXvOyhKiRjdCwSDYSGmdLIH0J/5HmHYcaMd59i80 SoTehz8jzQhDSeXfMzNjNURgacEUxztKnLsXYuZEBT8yGlKtrcn5Og0rV/q3wjj3JzoX JU1Lc6phPCVTDJXLzmke1xSw95DsU/CIEP8BNBQa7Xq/RZhMwi/pKShWrMmcD2+z5GBl LV8LcM8/lfgvDfherzqTctAJprPY7rHBapJePDnBfJDpxQtTezxLfZnbFSQQmoDrU4ka Gacg== X-Gm-Message-State: AFqh2koZ9aRfsfjOruQ7OhsPsPxnDmwdQLfUL/Ct5me3efW1gcxXOUSY 3FH6jS4UNrxgd+ZEc4nHqadIkJPQnx7VikcxbaPRTw== X-Google-Smtp-Source: AMrXdXvd5aXgip5Qzac3nnS4UJh41q1WH7DpBV8Y177SVJ9L+IRjPYscACsdX6w/p31TMB4jchpss74qO3nAuW0unDA= X-Received: by 2002:a05:600c:ad5:b0:3da:1b37:8ff5 with SMTP id c21-20020a05600c0ad500b003da1b378ff5mr2331389wmr.166.1674853359019; Fri, 27 Jan 2023 13:02:39 -0800 (PST) MIME-Version: 1.0 References: <06423461-c543-56fe-cc63-cabda6871104@redhat.com> <6548b3b3-30c9-8f64-7d28-8a434e0a0b80@redhat.com> In-Reply-To: From: James Houghton Date: Fri, 27 Jan 2023 13:02:02 -0800 Message-ID: Subject: Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range To: Peter Xu Cc: Mike Kravetz , David Hildenbrand , Muchun Song , David Rientjes , Axel Rasmussen , Mina Almasry , "Zach O'Keefe" , Manish Mishra , Naoya Horiguchi , "Dr . David Alan Gilbert" , "Matthew Wilcox (Oracle)" , Vlastimil Babka , Baolin Wang , Miaohe Lin , Yang Shi , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Queue-Id: 9904720045 X-Rspamd-Server: rspam01 X-Stat-Signature: yghc7m39dui9oecaiim9n6br8wdgjyjt X-HE-Tag: 1674853365-133609 X-HE-Meta: U2FsdGVkX19uPbLPiJ+HdGJAn8oQNTNqPmMASGkmwMt6z5p1KsTdPDuwW31YIMwRgTfQlazW5HTFagcTpN0QI5Ha+HIucE/IF5Igq7eCYooLBqV8D67uWjVIBp+wy275XSl9RWIocXBsM7AksW8+UIK5C4HxnhOH6p9WCThnaz/av3u7jaLHJ8jrBez8esWi1fqGpVwvPnnlL8ntSxIHEeW8SS0MZ5JVQQqKDDiDDetEJF+43ABvm18lT5id/yzrUZpAayGOAJlUcS2aRoS2WZn320ibAilMBM8MGqYij8iVHns+IPHXoh8bJ3AhCEn6JIW8EGkl94EnGbfFbk9nNfmhZL4xKrulI/n2OwFQl2PJV5eT9TSPnKozgvkhjVS3Jd3mph3V9yoHOSv/YmGngvN+U1tYDX37qpezqBB+i/gpiRsPBXet2MbyuejXpjlke9Hrty5emYu40Iwd3VYDYrz0dga5M5M8r70j2YYasD0DWtEm/3c0/kwsd7mnVtYXYd4Q3kcuVDGYX+jB2m/JuUtOzSy4Wr9j8JgdJjObnQmOSLg6bv+sK86Rk7ZPl7uatZDqtUmhRIMeIVgl/zqySysa65f6n47e0K5OQN4JtUvhoDwzrLqmE09uGBC0w/Xe/jFPWTPLVVCTGT8p6z/uHUfj5ZWG3lk5wyq7urMhz9rKkdvRCbUEn3Iz8OWzCrgNpjIWaMIkStm30/DPmhWKrn3oxEb1MUD7i3zTs5j/zdswYcly6AS4RU6TuwEzYB5dl/klukMHrlShGm7idhY1CXr61AIHhUxqRG59XP23/4/bqyaDOMORXKrpgbejTBYfjSeXO4jHm2DoaO+2FAUd/4MPmuCXQep4USSd7lHgqHkP5qfPSKv+RHqefW0EPnZuyG87U/M9U+MVLgNENJq80NJtDf0jS8tQI/gExTeu8DtdrUHb5eLulZ6k5xdcsEYYJWJ59cxMsaJdiFg8WeZ HqRc+Zh4 MedWlWligewiC0Pz4IRoT90h8TSFx850ebHXkDaLos5Q042LfjDctQWZklyKo4uSSaBoI4N7Fzo9J6NJftAxT4RK8FQTmeHWKpK8mkw6MM1iQzOelWfIBfuZSONKAjiO8hcx9/82jnYcALVqTbVBNr9UejO1WM3JuDEi/4ZJplVZkUmyv4RAeXkxoceX5fSXfkbfJNrt4PE20Esay+GUZbXi8DkL8PDf+nRMGjBjhqJapBml2CNhZ+yFlu/9j93YBvcuR452u+mgnnIJGkwr3PUqK7Vw80TB37b2eXWtvTjD3HfmJx9DkOdlYXS0GCrq5WrjgEpB7lKsZwBI1IxD/ViHji/KL6vTCj/KQeirLLkQKwsIBccY83A6cWD3gQblr3ZPX0Ste03xYQqnY8wUByvrW85Z6nF2vYbBhbgzGFL8vjURfJERoRHsdXIWf94kZjdUZuvnXH+8v+SXktb3iS4DMLaa3diZYsiz/a6gf0eo2F3QZU3Wjep9VLyGNJ9Eqn3DsXrv4acM4SQlPTNePNRep5D/H5ITKzS3SoQtuy+4VDECpyTZ/qiDwjg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jan 26, 2023 at 12:31 PM Peter Xu wrote: > > James, > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote: > > It turns out that the THP-like scheme significantly slows down > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes > > the vast majority of the time spent in MADV_COLLAPSE when collapsing > > 1G mappings. It is doing 262k atomic decrements, so this makes sense. > > > > This is only really a problem because this is done between > > mmu_notifier_invalidate_range_start() and > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to > > access any of the 1G page while we're doing this (and it can take like > > ~1 second for each 1G, at least on the x86 server I was testing on). > > Did you try to measure the time, or it's a quick observation from perf? I put some ktime_get()s in. > > IIRC I used to measure some atomic ops, it is not as drastic as I thought. > But maybe it depends on many things. > > I'm curious how the 1sec is provisioned between the procedures. E.g., I > would expect mmu_notifier_invalidate_range_start() to also take some time > too as it should walk the smally mapped EPT pgtables. Somehow this doesn't take all that long (only like 10-30ms when collapsing from 4K -> 1G) compared to hugetlb_collapse(). > > Since we'll still keep the intermediate levels around - from application > POV, one other thing to remedy this is further shrink the size of COLLAPSE > so potentially for a very large page we can start with building 2M layers. > But then collapse will need to be run at least two rounds. That's exactly what I thought to do. :) I realized, too, that this is actually how userspace *should* collapse things to avoid holding up vCPUs too long. I think this is a good reason to keep intermediate page sizes. When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a huge difference: the THP-like scheme is about 30% slower overall. When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE difference. For the THP-like scheme, collapsing 4K -> 2M requires decrementing and then re-incrementing subpage->_mapcount, and then from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For the head-only scheme, for each 2M in the 4K -> 2M collapse, we decrement the compound_mapcount 512 times (once per PTE), then increment it once. And then for 2M -> 1G, for each 1G, we decrement mapcount again by 512 (once per PMD), incrementing it once. The mapcount decrements are about on par with how long it takes to do other things, like updating page tables. The main problem is, with the THP-like scheme (implemented like this [1]), there isn't a way to avoid the 262k decrements when collapsing 1G. So if we want MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API, then I think something more clever needs to be implemented. [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178 - James