From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 875F6C54E94 for ; Wed, 25 Jan 2023 18:26:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D04E46B0078; Wed, 25 Jan 2023 13:26:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CB4896B007B; Wed, 25 Jan 2023 13:26:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B7C0D6B007D; Wed, 25 Jan 2023 13:26:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A9B9F6B0078 for ; Wed, 25 Jan 2023 13:26:30 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 80AA0A02C7 for ; Wed, 25 Jan 2023 18:26:30 +0000 (UTC) X-FDA: 80394151740.18.7EED197 Received: from mail-pf1-f171.google.com (mail-pf1-f171.google.com [209.85.210.171]) by imf14.hostedemail.com (Postfix) with ESMTP id 983B410001B for ; Wed, 25 Jan 2023 18:26:28 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=lqIlFyn3; spf=pass (imf14.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.171 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674671188; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Qiw9W/v3g1bB1cnQlN/z1IF/dSmDrTwpeyn4T1UZkNY=; b=wW3kqUUmZAf2szLd+ohBcU0KXZlCpUQwONugrmoVx90AaHMOVfbOIW8CQxXE4NNzhCA/Ru +gcuzBOy29W+3KH9foIJiipkdr13ppETqLm3NF2ZmHE3JewKCJQ89ABdPNfl0tVsm8bqSA UrBtbYvRCWVmfJBUzfPn8uYcyD1wfpE= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=lqIlFyn3; spf=pass (imf14.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.171 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674671188; a=rsa-sha256; cv=none; b=zbon9ic56Dz696kYwglwYnaWyiTig0GQp5+8PiaQSDcUJ66L4NTq9Np3C4CzZpe30RhGid mTwqDEuH7mT/UdQjz2NEy0mkxhAQzKI64S9qQFRHc3S1PbKX3iR6JXKCiM8CQGDj0Bl//l BHskOK2MSSY1Od2N6GNy9bDjk/+o7Ig= Received: by mail-pf1-f171.google.com with SMTP id 144so1028164pfv.11 for ; Wed, 25 Jan 2023 10:26:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Qiw9W/v3g1bB1cnQlN/z1IF/dSmDrTwpeyn4T1UZkNY=; b=lqIlFyn3xJQ5VB1zNrzl7s1nRf12QqsDihMpDG5M3/cXly9UBh6Lfd2TKUwQykQpbg Yds7bavfTCOMZHjpoW6ltFAzPpyf17R0MQJMiOR2yACwVvSqYoNFxv/gppaTcthAN+kl Ss3o7iaN++d0MUf+xEX8/zNSfdQsTChNb60VXOouj/+E3vn77vecfLYN/Q5puvSn3gA4 xYjQb+R38nGnII/LNnQr5FmllF5MINYU3PJhJIKaWz0yUD3XLAv0fhknO6ZPi1qmRema JjSyHoKKnfxJwf9DBD0rie9dZf5mhvo+aVg0R9yIolEyy7pnvVnZvDeWMVs46anvJFId s8dw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Qiw9W/v3g1bB1cnQlN/z1IF/dSmDrTwpeyn4T1UZkNY=; b=Xy21ToC7PNFAUzlPQtPPNjr41jLi3VW2lXfE4g9S6T+iExLP6oux2I6iLfI94DSqU0 0BAvNfuC+DSHLMUOy+FX4ZF/UiY5tGtRBgC8z2UhksHxHtuuKlCPqUwq6C2SgEiSSkbH UqquojuPp0ER/80SsEX0hVuld92KRhFpU7D4yzwjd9yayQBLasnzkRVJHzmIW78Gs1zn 5D59FDsE81BZl5R1idULO2ONA2rdRoMmSMZIFnh+saI4X/eIPnltsNAAkoct7aVHz1w3 JGBJQFiKrrAdelQTJPjHkiffS5aIF2bK2TEWWY3k4N7Qub9K2HytrXxYQ8BeahQOBw4/ ENzg== X-Gm-Message-State: AFqh2krxt/lb2hKCDer0cVrEXBLLrrIiBCLmQkxtRsLvx6MPHID6Y84/ 8NG4hz5p/axz0wGLHQ5C9zQWfBTigB82kejpdQI= X-Google-Smtp-Source: AMrXdXslBIh4vQXaYNp55R1PSvmFYMEvaARL00C1yq/L/us6YM0ZpNJqC5rOfhFUt9ynKBqtHDqae5K019v5Nhqhk9w= X-Received: by 2002:aa7:850f:0:b0:58d:acfe:7c34 with SMTP id v15-20020aa7850f000000b0058dacfe7c34mr3492837pfn.39.1674671187475; Wed, 25 Jan 2023 10:26:27 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Yang Shi Date: Wed, 25 Jan 2023 10:26:15 -0800 Message-ID: Subject: Re: A mapcount riddle To: Peter Xu Cc: Mike Kravetz , linux-mm@kvack.org, Naoya Horiguchi , David Rientjes , Michal Hocko , Matthew Wilcox , David Hildenbrand , James Houghton , Muchun Song Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: k6drmwjd7thum1cidnhcx5kacgqopdke X-Rspamd-Queue-Id: 983B410001B X-HE-Tag: 1674671188-996557 X-HE-Meta: U2FsdGVkX19aYJGQEOBmkGMDbU93OhMotpb66m7OGlJ51eM+aBbbOPT/spkP9w35NkDaO8fXcgCx/AK6KluDSib5YK7/IvqpcmuGAvg+LJ+W4PcwX/yBg7eQXHb4MvjDtPaAdpJ27flq8Qfq2Vd4AcDZLjDd/vsqtiXrY5olKA2NLu2UeHI4vXaw6lLLK/JJk/UDQwhaFqb2y2Z2GjcLyS5HLLnNfM9d4oapFnnLGmycVbMEsdfHsJGAP0fq1N9K7wwdbxpibP4I6jYioe4AZO3R4jVl4bKleRZUhSOkTobWFFDBrLNBQzled6W5B6YUs+mHa4hnYblSW2wAfUl+6Ndwv9R9hd2PY4yC+qaXUno+ptXTqGQGjBHT0Yi01h4D2wrDg3lbOTvSGp4jFOhX+CBlKh3fgcbm0IUpWncu9ysb+oftbiglZpLOIhNG9BEen9KkBbehrrXgASf0QhyakDuNpcHGsizDlRWU18Nk78LilQoTP4G1sTG9LwtJYqMF3PUqYYTl60tS1H+7ozJU1Zi3d/haQebExb43fhJPTV5ooKAC/KipmX/p++BbtJIt142WssCS78cuSR+8VSCsMn8BuNTv2Vvw/i4qbfwsdlLucVQ9YXZWXrnP5Ut39LbCWb8I80cFAWNWZkgv4ANjJ8Um6zhv+XeyDkrXYnNaG0tNmsOW7iWQCwNLB4kXIgEgYOI/O4ACaPSblDZJ62oV+x0KrFGAXkHZrtJZnW7HlK183Cmw4Y7pkVu7/UWe4DFQpdNIGKofJHQCNaejEszIQg8Y3609TnAomgrN6zDE5OeWcbVZ9jOp1NjWw2d4AZ/8mfMRqD8vqbQAVp5Qsh+9TowT9mD4f4siB9Rbitylmbdui6J7elVSEYTAtyogbFxRD0Ri87RiM7aeFMDtNHDo87i4EZ/o4pilVheRUCq00L84sYm5L0MGdUkc2aljUnFVZWq6J5VKKM5GZ+unycZ MbGr3l3R yKRZtIVFbC3cKG7eUO1hNBzo/lpz6XsYuZgStFJ8IVqyIkXqcjN9waKvjIGACbblDPxBalgskDQFuEw34MEezEw2471LHSacoxuZmlNusNsdhgphPZb65z2aXobJ9IfiXLJdgUlt5gui93uNG1/fHNsnaSLBGxGkPtev4dQbwfmtGIZ3n4VW/BzbSBXGrxaenCJUAJE4lrrQRDLyysrWpRCbtXO/YWcJD6JRYWShGqsHQAcQ4hyPD/iKBjqcZc/D1/ANjUu584aLw4zxY8PWumrBPD4riB979tXkuCjSAf5wIT6U= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jan 25, 2023 at 8:02 AM Peter Xu wrote: > > On Tue, Jan 24, 2023 at 03:29:53PM -0800, Yang Shi wrote: > > On Tue, Jan 24, 2023 at 3:00 PM Peter Xu wrote: > > > > > > On Tue, Jan 24, 2023 at 12:56:24PM -0800, Mike Kravetz wrote: > > > > Q How can a page be mapped into multiple processes and have a > > > > mapcount of 1? > > > > > > > > A It is a hugetlb page referenced by a shared PMD. > > > > > > > > I was looking to expose some basic information about PMD sharing via > > > > /proc/smaps. After adding the code, I started a couple processes > > > > sharing a large hugetlb mapping that would result in the use of > > > > shared PMDs. When I looked at the output of /proc/smaps, I saw > > > > my new metric counting the number of shared PMDs. However, what > > > > stood out was that the entire mapping was listed as Private_Hugetlb. > > > > WTH??? It certainly was shared! The routine smaps_hugetlb_range > > > > decides between Private_Hugetlb and Shared_Hugetlb with this code: > > > > > > > > if (page) { > > > > int mapcount = page_mapcount(page); > > > > > > > > if (mapcount >= 2) > > > > mss->shared_hugetlb += huge_page_size(hstate_vma(vma)); > > > > else > > > > mss->private_hugetlb += huge_page_size(hstate_vma(vma)); > > > > } > > > > > > This is definitely unfortunate.. > > > > > > > > > > > After spending some time looking for issues in the page_mapcount code, > > > > I came to the realization that the mapcount of hugetlb pages only > > > > referenced by a shared PMD would be 1 no matter how many processes had > > > > mapped the page. When a page is first faulted, the mapcount is set to 1. > > > > When faulted in other processes, the shared PMD is added to the page > > > > table of the other processes. No increase of mapcount will occur. > > > > > > > > At first thought this seems bad. However, I believe this has been the > > > > behavior since hugetlb PMD sharing was introduced in 2006 and I am > > > > unaware of any reported issues. I did a audit of code looking at > > > > mapcount. In addition to the above issue with smaps, there appears > > > > to be an issue with 'migrate_pages' where shared pages could be migrated > > > > without appropriate privilege. > > > > > > > > /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */ > > > > if (flags & (MPOL_MF_MOVE_ALL) || > > > > (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) { > > > > if (isolate_hugetlb(page, qp->pagelist) && > > > > (flags & MPOL_MF_STRICT)) > > > > /* > > > > * Failed to isolate page but allow migrating pages > > > > * which have been queued. > > > > */ > > > > ret = 1; > > > > } > > > > > > > > I will prepare fixes for both of these. However, I wanted to ask if > > > > anyone has ideas about other potential issues with this? > > > > > > This reminded me whether things should be checked already before this > > > happens. E.g. when trying to share pmd, whether it makes sense to check > > > vma mempolicy before doing so? > > > > > > Then the question is if pmd sharing only happens with the vma that shares > > > the same memory policy, whether above mapcount==1 check would be acceptable > > > even if it's shared by multiple processes. > > > > I don't think so. One process might change its policy, for example, > > bind to another node, then result in migration for the hugepage due to > > the incorrect mapcount. The above example code pasted by Mike actually > > comes from mbind if I remember correctly. > > Yes, or any page migrations. Above was a purely wild idea that we share > pmd based on vma attributes matching first (shared, mapping alignments, > etc.). It can also take mempolicy into account so that when migrating one > page on the shared pmd, one can make a decision for all with mapcount==1 > because that single mapcount may stand for all the sharers of the page as > long as they share the same mempolicy. > > If above idea applies, we'll also need to unshare during mbind() when the > mempolicy of vma changes for hugetlb in this path, because right after the > mempolicy changed the vma attribute changed, so pmd sharing doesn't hold. Make sense to me. The vmas with different mempolicies can't be merged. So, they shouldn't share PMD either. > > But please also ignore this whole thing - I don't think that'll resolve the > generic problem of mapcount issue on pmd sharing no matter what. It's just > something come up to mind when I read it. > > > > > I'm wondering whether we could use refcount instead of mapcount to > > determine if hugetlb page is shared or not, assuming refcounting for > > hugetlb page behaves similar to base page (inc when mapped by a new > > process or pinned). If it is pinned (for example, GUP) we can't > > migrate it either. > > I think refcount has the same issue because it's not accounted either when > pmd page is shared. Good to know that. It is a little bit counter-intuitive TBH. > > -- > Peter Xu >