From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C0DDBC433FE for ; Wed, 10 Nov 2021 22:59:39 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6D62061279 for ; Wed, 10 Nov 2021 22:59:39 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 6D62061279 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id ADCB66B00CF; Wed, 10 Nov 2021 17:59:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A64A76B00D2; Wed, 10 Nov 2021 17:59:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8B74C6B00D3; Wed, 10 Nov 2021 17:59:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0082.hostedemail.com [216.40.44.82]) by kanga.kvack.org (Postfix) with ESMTP id 767BB6B00CF for ; Wed, 10 Nov 2021 17:59:38 -0500 (EST) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id EA3AF78B2F for ; Wed, 10 Nov 2021 22:59:37 +0000 (UTC) X-FDA: 78794539194.03.50DF0B7 Received: from mail-io1-f50.google.com (mail-io1-f50.google.com [209.85.166.50]) by imf18.hostedemail.com (Postfix) with ESMTP id EE4464002085 for ; Wed, 10 Nov 2021 22:59:36 +0000 (UTC) Received: by mail-io1-f50.google.com with SMTP id 14so4715949ioe.2 for ; Wed, 10 Nov 2021 14:59:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=+JlQfRhSHIQumShIo8LIt8iJ9iwe6QLgwaUh4qxgdH8=; b=m9mMxsf2J/5hzPfi1EyzhuX/LUx23jCdEmbTTL8w5RqbmubEipFXtJ/UIdXiYwQThu xyrmYAwkvpOdyeiGyNcLFRp27DKZ67wCrSz2u0LqddsMIN+6jArJhFfx+9U9D9Ehz9Dz 6ENjRpygKOtqOcInoGqhv63hDaB1+JHwkrtfnAtKotqhz09z2w2B2XVU0WTriClNDkj3 qY371361T5iI84L3OcLyvT6qnT5bEfs5JKaUaavlF4cfb1OH4iimvvr51RYR79y5kIc7 0hQ9D+/zUvWuxFpoyj7WSuwdVkm0Ut716Tj8uq6FOjc0fvcOBBp3HPa1OfHGLo5zS/sF bkaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=+JlQfRhSHIQumShIo8LIt8iJ9iwe6QLgwaUh4qxgdH8=; b=WcENXYqIdfqdTxYEc9s0Tsw2f2g4M6eGEuybq2a6czrtE9SaxyX9WVawdhFIX1rBx8 a497H8nJIBft/I+h2slKp1vSnPl7xf+6K66DmQYmkr0v72NJlwF+NRPNK7lv26Da9AQP Lti/cR2cwoJ8OxKH2AY85gPhj2XdSt2390X7KdDYRaMrEhVMBQpUfqXQ20zo/7Vo+IZ8 xI+afruYmTbl9Lk8+xwUQ3FBS1ppR9FdcURFbrcOnWj98LIP7y8Jje6BXfGWuGdPoqGE A+tA2hLPw8nr5KUw0NMEYs5FZUO1Rc7mljYHh07wEWSN8FEO8h4Pg75TDkfH162VFotN CwLA== X-Gm-Message-State: AOAM532XTDkdMZNVhrAaqKYnR5WEjKOcYe2BzcHkA/BvN5UHGHoKUKN2 5k6aD7NeUsiYYCWkrdU+saK5xrzmfU2zyH0phQb8OA== X-Google-Smtp-Source: ABdhPJz4h5QttK7JZ/ccpGgBYt9Y4qBUzjRohviQsLr3lRwEBxbOy9y9B1T3JOMDRBiVZdQ+xFBrC9nKkJC4xi3vEOk= X-Received: by 2002:a5e:cb0d:: with SMTP id p13mr1805128iom.71.1636585176019; Wed, 10 Nov 2021 14:59:36 -0800 (PST) MIME-Version: 1.0 References: <20211108210456.1745788-1-almasrymina@google.com> In-Reply-To: From: Mina Almasry Date: Wed, 10 Nov 2021 14:59:24 -0800 Message-ID: Subject: Re: [PATCH v4] hugetlb: Add hugetlb.*.numa_stat file To: Shakeel Butt Cc: Mike Kravetz , Andrew Morton , Shuah Khan , Miaohe Lin , Oscar Salvador , Michal Hocko , Muchun Song , David Rientjes , Jue Wang , Yang Yao , Joanna Li , Cannon Matthews , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: EE4464002085 X-Stat-Signature: 4g8xjxxemiukptkqimng6gohz93w5x63 Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=m9mMxsf2; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf18.hostedemail.com: domain of almasrymina@google.com designates 209.85.166.50 as permitted sender) smtp.mailfrom=almasrymina@google.com X-HE-Tag: 1636585176-497062 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Nov 9, 2021 at 10:15 PM Shakeel Butt wrote: > > On Mon, Nov 8, 2021 at 1:05 PM Mina Almasry wrot= e: > > > > For hugetlb backed jobs/VMs it's critical to understand the numa > > information for the memory backing these jobs to deliver optimal > > performance. > > > > Currently this technically can be queried from /proc/self/numa_maps, bu= t > > there are significant issues with that. Namely: > > 1. Memory can be mapped on unmapped. > > or* > > > 2. numa_maps are per process and need to be aggregated across all > > processes in the cgroup. For shared memory this is more involved as > > the userspace needs to make sure it doesn't double count shared > > mappings. > > 3. I believe querying numa_maps needs to hold the mmap_lock which adds > > to the contention on this lock. > > > > For these reasons I propose simply adding hugetlb.*.numa_stat file, > > which shows the numa information of the cgroup similarly to > > memory.numa_stat. > > > > On cgroup-v2: > > cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat > > total=3D2097152 N0=3D2097152 N1=3D0 > > > > On cgroup-v1: > > cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat > > total=3D2097152 N0=3D2097152 N1=3D0 > > hierarichal_total=3D2097152 N0=3D2097152 N1=3D0 > > > > This patch was tested manually by allocating hugetlb memory and queryin= g > > the hugetlb.*.numa_stat file of the cgroup and its parents. > > =EF=BF=BC > > Cc: Mike Kravetz > > Cc: Andrew Morton > > Cc: Shuah Khan > > Cc: Miaohe Lin > > Cc: Oscar Salvador > > Cc: Michal Hocko > > Cc: Muchun Song > > Cc: David Rientjes > > Cc: Shakeel Butt > > Cc: Jue Wang > > Cc: Yang Yao > > Cc: Joanna Li > > Cc: Cannon Matthews > > Cc: linux-mm@kvack.org > > Cc: linux-kernel@vger.kernel.org > > > > Signed-off-by: Mina Almasry > > > [...] > > > diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgr= oup.h > > index c137396129db..54ff6ec68ed3 100644 > > --- a/include/linux/hugetlb_cgroup.h > > +++ b/include/linux/hugetlb_cgroup.h > > @@ -36,6 +36,11 @@ enum hugetlb_memory_event { > > HUGETLB_NR_MEMORY_EVENTS, > > }; > > > > +struct hugetlb_cgroup_per_node { > > + /* hugetlb usage in bytes over all hstates. */ > > + unsigned long usage[HUGE_MAX_HSTATE]; > > Should we use atomic here? Is this safe for all archs? > > [...] > > > > > /* > > @@ -292,7 +321,8 @@ static void __hugetlb_cgroup_commit_charge(int idx,= unsigned long nr_pages, > > return; > > > > __set_hugetlb_cgroup(page, h_cg, rsvd); > > - return; > > + if (!rsvd && h_cg) > > No need to check h_cg. > > > + h_cg->nodeinfo[page_to_nid(page)]->usage[idx] +=3D nr_p= ages; > > } > > > > void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages, > > @@ -331,7 +361,8 @@ static void __hugetlb_cgroup_uncharge_page(int idx,= unsigned long nr_pages, > > > > if (rsvd) > > css_put(&h_cg->css); > > - > > + else > > + h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -=3D nr_p= ages; > > return; > > } > > > > @@ -421,6 +452,58 @@ enum { > > RES_RSVD_FAILCNT, > > }; > > > > +static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *d= ummy) > > +{ > > + int nid; > > + struct cftype *cft =3D seq_cft(seq); > > + int idx =3D MEMFILE_IDX(cft->private); > > + bool legacy =3D MEMFILE_ATTR(cft->private); > > + struct hugetlb_cgroup *h_cg =3D hugetlb_cgroup_from_css(seq_css= (seq)); > > + struct cgroup_subsys_state *css; > > + unsigned long usage; > > + > > + if (legacy) { > > + /* Add up usage across all nodes for the non-hierarchic= al total. */ > > + usage =3D 0; > > + for_each_node_state(nid, N_MEMORY) > > + usage +=3D h_cg->nodeinfo[nid]->usage[idx]; > > + seq_printf(seq, "total=3D%lu", usage * PAGE_SIZE); > > + > > + /* Simply print the per-node usage for the non-hierarch= ical total. */ > > + for_each_node_state(nid, N_MEMORY) > > + seq_printf(seq, " N%d=3D%lu", nid, > > + h_cg->nodeinfo[nid]->usage[idx] * PA= GE_SIZE); > > + seq_putc(seq, '\n'); > > + } > > + > > + /* > > + * The hierarchical total is pretty much the value recorded by = the > > + * counter, so use that. > > + */ > > + seq_printf(seq, "%stotal=3D%lu", legacy ? "hierarichal_" : "", > > + page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE)= ; > > + > > + /* > > + * For each node, transverse the css tree to obtain the hierari= chal > > + * node usage. > > + */ > > + for_each_node_state(nid, N_MEMORY) { > > + usage =3D 0; > > + rcu_read_lock(); > > + css_for_each_descendant_pre(css, &h_cg->css) { > > This will be slow. Do you really want to traverse the hugetlb-cgroup > tree that many times? We had similar issues with memcg stats as well > but got resolved with rstat. > > Though I feel like hugetlb-cgroup is not that worried about > performance. There is no batching in the charging and uncharging > codepaths. > I'm not seeing a way to get this info without paying the cost to transverse the tree somewhere (either at charge time or at collection time). I think maybe we can wait and see if there are issues with the performance and if so we can adopt the memcg solutions or something else. Thank you for reviewing. I'm about to upload v5 with fixes. > > + usage +=3D hugetlb_cgroup_from_css(css) > > + ->nodeinfo[nid] > > + ->usage[idx]; > > + } > > + rcu_read_unlock(); > > + seq_printf(seq, " N%d=3D%lu", nid, usage * PAGE_SIZE); > > + } > > + > > + seq_putc(seq, '\n'); > > + > > + return 0; > > +}