From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B9655C4332F
	for <linux-mm@archiver.kernel.org>; Wed, 12 Oct 2022 19:45:24 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E263A6B0071; Wed, 12 Oct 2022 15:45:23 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DD661900002; Wed, 12 Oct 2022 15:45:23 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C78816B0074; Wed, 12 Oct 2022 15:45:23 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id B1D006B0071
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 15:45:23 -0400 (EDT)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 8171D16116D
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 19:45:23 +0000 (UTC)
X-FDA: 80013326526.18.1B15CD4
Received: from mail-qk1-f179.google.com (mail-qk1-f179.google.com [209.85.222.179])
	by imf20.hostedemail.com (Postfix) with ESMTP id 3E6E91C003A
	for <linux-mm@kvack.org>; Wed, 12 Oct 2022 19:45:23 +0000 (UTC)
Received: by mail-qk1-f179.google.com with SMTP id d13so4842281qko.5
        for <linux-mm@kvack.org>; Wed, 12 Oct 2022 12:45:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=YQM5wV6XpR4edsco6nT6p0x8TORjKYO8fmyDh9krb5U=;
        b=J+eAM2KsHmqTO4j410J1EjDASf5Uzw0DD4qQEpfhwST+BqmGmEHItRTxjseh6B50CP
         yAKDZ/WE/H/btF4jEOaIUPnu4XukuMM4xSE9As+6fpLhtuYuHek4oMjR+hgykg0GFLNu
         bwXvGU9XW5XR7up1ZA8rN0CnZlmraSFNKUETd8J13dEBuVR5myKAkdGaUgSOzRoxAq1u
         /odOfqdo1sN6NB8f33GyPz74ceWm7Kh0OPGc8WvoFZccepqVxtlITXuVSV1WD+oWkaTB
         vCbssZXzZ82MnqhPRORMJ6j7WwjaUTCiYtCXxx/aoBIFMM7fxli8gotkLqKXKbIO/Q1e
         B6UA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=YQM5wV6XpR4edsco6nT6p0x8TORjKYO8fmyDh9krb5U=;
        b=fHXkU9wvH7++Gt2PrNa1JA1V8OVrXhRRsdg30yYYiB5XQYOaM9Xpk36HZdDlJFB+51
         AQftwpvqTKI0ybX2OvWMQ2v26mQq3O1WgYyOHQUONBV1hqEgAgXeTkIPjracAKeRUINq
         lvVk1zDU3oItugSN4NYRJDyHGvuubNUpVeSonNXQLDrupAm1Y3KhvvppvTHmDxjOaw/S
         bZAEOfNcf9n9VpVbD3jmv6LoPFizw0dVXd23KDNMiJqrG9oXech8ldDjZFGnihF9G5h3
         +SvTMGIL/OoReIPKtc1T1Q2SvgPy0SDdBLMrhkRRu/lzYo+rS/AWzx0OjFkT4vrEVkg5
         y/ng==
X-Gm-Message-State: ACrzQf3elrGbYR+S41bqRzDwNNnqiA+yiHsQ7BO2ant21ZFfoXobNRRe
	HxaQWCN8WY/4tzr8OoQdl95DBQ==
X-Google-Smtp-Source: AMsMyM5/2tmKQ1gLVp8CBNyys8UCJDepHy9YOo87yOfZthX2AlQGwxPLn/J3N7AhaNHAuXZPCHKSCA==
X-Received: by 2002:a05:620a:1a23:b0:6ee:7b21:95f1 with SMTP id bk35-20020a05620a1a2300b006ee7b2195f1mr7278006qkb.296.1665603922279;
        Wed, 12 Oct 2022 12:45:22 -0700 (PDT)
Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id v8-20020a05620a440800b006ecc5f5635dsm7275583qkp.113.2022.10.12.12.45.21
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 12 Oct 2022 12:45:21 -0700 (PDT)
Date: Wed, 12 Oct 2022 12:45:06 -0700 (PDT)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.attlocal.net
To: Mike Kravetz <mike.kravetz@oracle.com>
cc: Albert Huang <huangjie.albert@bytedance.com>, 
    Muchun Song <songmuchun@bytedance.com>, Andi Kleen <ak@linux.intel.com>, 
    Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, 
    linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops
In-Reply-To: <20221012081526.73067-1-huangjie.albert@bytedance.com>
Message-ID: <5f7ef6ee-6241-9912-f434-962be53272c@google.com>
References: <20221012081526.73067-1-huangjie.albert@bytedance.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=J+eAM2Ks;
	spf=pass (imf20.hostedemail.com: domain of hughd@google.com designates 209.85.222.179 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665603923; a=rsa-sha256;
	cv=none;
	b=TIDb9kBpprZti1ymq7WGH/JsyfXhwPxAdsSpjKd/HFonot+WL0cFEWEOg0JYI1J8Ji0/Jz
	eM/vJ/Qt6kJQ3+ansa7zZ57ByrqZSK/YHXIKmcsg0a8Anyxex0IU7BViGxi6qT0yzhtrrB
	wJXZw8hr7Vw8jLEayF41UVrOufkLxFs=
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1665603923;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=YQM5wV6XpR4edsco6nT6p0x8TORjKYO8fmyDh9krb5U=;
	b=S5z7cUfy3ubFxJcKkHky9Yaje2JEuUbdKT4ZVzXyvchRz4e/WbjEB7j1JTeBYm1NSSPkuW
	dK3wVd6ju4GTluP8NXc/XZBOV9qsY7IFCwQ8MNucXDdzh5wA+J/eZnP1YXJcl7bZjEU9wK
	wTLJYcNmbAHWdB0EXk90mrpMeqqiJLg=
X-Rspamd-Queue-Id: 3E6E91C003A
X-Rspam-User: 
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=J+eAM2Ks;
	spf=pass (imf20.hostedemail.com: domain of hughd@google.com designates 209.85.222.179 as permitted sender) smtp.mailfrom=hughd@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Rspamd-Server: rspam04
X-Stat-Signature: 7ed4jywf8u96fgqqrt9jp57h49yqiewm
X-HE-Tag: 1665603923-819669
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, 12 Oct 2022, Albert Huang wrote:

> From: "huangjie.albert" <huangjie.albert@bytedance.com>
> 
> implement these two functions so that we can set the mempolicy to
> the inode of the hugetlb file. This ensures that the mempolicy of
> all processes sharing this huge page file is consistent.
> 
> In some scenarios where huge pages are shared:
> if we need to limit the memory usage of vm within node0, so I set qemu's
> mempilciy bind to node0, but if there is a process (such as virtiofsd)
> shared memory with the vm, in this case. If the page fault is triggered
> by virtiofsd, the allocated memory may go to node1 which  depends on
> virtiofsd.
> 
> Signed-off-by: huangjie.albert <huangjie.albert@bytedance.com>

Aha!  Congratulations for noticing, after all this time.  hugetlbfs
contains various little pieces of code that pretend to be supporting
shared NUMA mempolicy, but in fact there was nothing connecting it up.

It will be for Mike to decide, but personally I oppose adding
shared NUMA mempolicy support to hugetlbfs, after eighteen years.

The thing is, it will change the behaviour of NUMA on hugetlbfs:
in ways that would have been sensible way back then, yes; but surely
those who have invested in NUMA and hugetlbfs have developed other
ways of administering it successfully, without shared NUMA mempolicy.

At the least, I would expect some tests to break (I could easily be
wrong), and there's a chance that some app or tool would break too.

I have carried the reverse of Albert's patch for a long time, stripping
out the pretence of shared NUMA mempolicy support from hugetlbfs: I
wanted that, so that I could work on modifying the tmpfs implementation,
without having to worry about other users.

Mike, if you would prefer to see my patch stripping out the pretence,
let us know: it has never been a priority to send in, but I can update
it to 6.1-rc1 if you'd like to see it.  (Once upon a time, it removed
all need for struct hugetlbfs_inode_info, but nowadays that's still
required for the memfd seals.)

Whether Albert's patch is complete and correct, I haven't begun to think
about: I am not saying it isn't, but shared NUMA mempolicy adds another
dimension of complexity, and need for support, that I think hugetlbfs
would be better off continuing to survive without.

Hugh

> ---
>  mm/hugetlb.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0ad53ad98e74..ed7599821655 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4678,6 +4678,24 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_NUMA
> +int hugetlb_vm_op_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
> +{
> +	struct inode *inode = file_inode(vma->vm_file);
> +
> +	return mpol_set_shared_policy(&HUGETLBFS_I(inode)->policy, vma, mpol);
> +}
> +
> +struct mempolicy *hugetlb_vm_op_get_policy(struct vm_area_struct *vma, unsigned long addr)
> +{
> +	struct inode *inode = file_inode(vma->vm_file);
> +	pgoff_t index;
> +
> +	index = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> +	return mpol_shared_policy_lookup(&HUGETLBFS_I(inode)->policy, index);
> +}
> +#endif
> +
>  /*
>   * When a new function is introduced to vm_operations_struct and added
>   * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops.
> @@ -4691,6 +4709,10 @@ const struct vm_operations_struct hugetlb_vm_ops = {
>  	.close = hugetlb_vm_op_close,
>  	.may_split = hugetlb_vm_op_split,
>  	.pagesize = hugetlb_vm_op_pagesize,
> +#ifdef CONFIG_NUMA
> +	.set_policy = hugetlb_vm_op_set_policy,
> +	.get_policy = hugetlb_vm_op_get_policy,
> +#endif
>  };
>  
>  static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
> -- 
> 2.31.1