From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED269E7545E for ; Tue, 3 Oct 2023 18:39:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 65FB26B018E; Tue, 3 Oct 2023 14:39:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5E88A6B0191; Tue, 3 Oct 2023 14:39:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 462026B0193; Tue, 3 Oct 2023 14:39:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 2F85A6B018E for ; Tue, 3 Oct 2023 14:39:33 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id F19F1803C0 for ; Tue, 3 Oct 2023 18:39:32 +0000 (UTC) X-FDA: 81305013384.24.E491AD8 Received: from mail-qk1-f170.google.com (mail-qk1-f170.google.com [209.85.222.170]) by imf23.hostedemail.com (Postfix) with ESMTP id E2E72140017 for ; Tue, 3 Oct 2023 18:39:30 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=iYZ6tzZy; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf23.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.170 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696358371; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ctR4Qss20hNlLcuTpt2GjwPGaaqxpILrMeXzriM3900=; b=vqEXgKr2ksa33L4kRBHz/p/1t6/uTlVZe7qbpjRNpOv0MNPoycBlSsmm/1HHbwh+uQZBEM Wt+itKfe6WibiEwpKwr9wfyLgzP3Z0h02SPR0vvTp5Ce9KpXARnClBBUqZCVy5bRIvxqH/ LrhBfI7ldqd0pR5iivlCGYT+C5t2djU= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=iYZ6tzZy; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf23.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.222.170 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696358371; a=rsa-sha256; cv=none; b=JSRgrj5o/SsTCC9p+5UuaXBFmos15ccxuTPQUN+opWKVZ4eiBDbSVpmYzRXjaNywvmoMbi I7WDUpiYWvJN7JzGigzjZf/d/j4WJXnQfc2nEXBE3Ym+j6+iFkIWqs9jN9d7VNUUfHq7AH dFQ6auRGA4/ic9xQctTzNnMeiICfslE= Received: by mail-qk1-f170.google.com with SMTP id af79cd13be357-7741c5bac51so87376485a.1 for ; Tue, 03 Oct 2023 11:39:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1696358370; x=1696963170; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=ctR4Qss20hNlLcuTpt2GjwPGaaqxpILrMeXzriM3900=; b=iYZ6tzZyNOIrGjnP5CXO6Zi5v/YbTBh2u/9mNvRgey/7fTpVaC+wbt7UlBrxUtIKC9 g/hP85jjwszlB1xENGpkgj36LEucDDaP/3wY3Y5hqeseGg4E90dTI/Z+TPb0pNT9N9t4 9+7KASCbD5lg3nf3t8bNQjfyMlqaxOlREjJaSpm6sBh9wyGh647s7APdvVFvnC3OTHNb m2TJctGQQ5J1dkPj04Qwx17A3XhpzYww8FiblD7yInj1iVPkH6LpQAGqTj0dK3VqVC+O YAcMX0S50yYwN4JNjJdn65R6dATaV1QrsAu6QyJyEImvZyh2i4RDw2azgoaVdaqva5SI 6SkQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696358370; x=1696963170; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ctR4Qss20hNlLcuTpt2GjwPGaaqxpILrMeXzriM3900=; b=YkDCXWYKClwyFLOX/18jIwuDHnCuW4d05uZ9NFZY1MVtATHT5XtivWLfVUMS6hgD0q Od5UhixHTXPHUK+6p2M/5+ucKKvF7aQuaKm/aoZKOnZGjOAFsVm+sZUbNBoKZJAFS5IK Tch6c0/ahJjVGU+K3sEBAcCN79DKLReOC3FrXpyPpAVEfNgKfrvSZO/vdoEfMszDySoJ wppO90+JgsDvLMnd5OLrC3k4IMmgQhyQj/SWougAhSnNXsZfqn4j7zaLeZrgCp6TlIiI j5fYcppp8qOHJSqOeD4iZjIxlR9AP6jaaSlsjrH6v5fBwfUtHJ1yd6cjgfda5XOxfchB 0/1A== X-Gm-Message-State: AOJu0YwcsNxFNaUAPj7YurLyZgRX5FoOW8gRaMaMq8YbroNzermbsk4R O+sJF1vcfMp5hz2hTzbw5WSi9w== X-Google-Smtp-Source: AGHT+IFJsgnCuHI5XnWSVYHlvhhimI0gL5pQbE6bdVmrvG4Sb4ugSd4dt+Kkwo8wesQqBgOoiOPooA== X-Received: by 2002:a0c:c407:0:b0:658:243e:3084 with SMTP id r7-20020a0cc407000000b00658243e3084mr163657qvi.55.1696358369899; Tue, 03 Oct 2023 11:39:29 -0700 (PDT) Received: from localhost ([2620:10d:c091:400::5:753d]) by smtp.gmail.com with ESMTPSA id c19-20020a0ce153000000b0065b17ec4b49sm691220qvl.46.2023.10.03.11.39.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Oct 2023 11:39:29 -0700 (PDT) Date: Tue, 3 Oct 2023 14:39:28 -0400 From: Johannes Weiner To: Nhat Pham Cc: Mike Kravetz , akpm@linux-foundation.org, riel@surriel.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, tj@kernel.org, lizefan.x@bytedance.com, shuah@kernel.org, yosryahmed@google.com, fvdl@google.com, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller Message-ID: <20231003183928.GC20979@cmpxchg.org> References: <20231003001828.2554080-1-nphamcs@gmail.com> <20231003001828.2554080-3-nphamcs@gmail.com> <20231003171329.GB314430@monkey> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: E2E72140017 X-Stat-Signature: dk5giuzd3r558zyimrwxkcsiwoh1mqb5 X-Rspam-User: X-HE-Tag: 1696358370-371851 X-HE-Meta: U2FsdGVkX184AwPpLboUnvV7Wq1GmApjq5fUkiF9IsJC636lQMuZohasASlfmnm7gUcs1zIXrs0hFDUaV8SgrLclOYjFKYM8zjWe6Gw1bOyqqoIBwmIwq2Ac6pMDVnJOEXapvs1W42ZBRG38Es+WwZ3Lsi7Q0WU2xExcEKJeblc3zSIk4KEEY+YeV+ACoxNEXlj42qyIsp+YCKgpyie81pcQE3Tcb1ul7t6+Me8BSn5NGwjxrtsjxmNFC3lzUolEaXCRyLwlNjsw0/GFQWYIgjScFUVsGPVWIbhwPcwijDnmEtQss931JOZxot+WSO6PgpsVpwPu1vV+nH8L3DSrhtfetMyvK73iak36YnJjHjd9DTf7yvxyKxha1dSyLHHV/c6+er3utWHSnmdFoDG4XM+i8kRBcni+e2/qB0oXVz/wwDzXHWWjDcw0WOOoxdCXQij4+qImzEev0ZeIeU6hNBONU7BRCnF4rr1SVtzha263R8GClTMo0mkPzSTBehCak+M0Z1UXCRn7pH/ni4G3EreALULs1JiG054AaiO3eTSIL26LDY6QAK+uPz/Ad5HJg1wg00oSpCPOUcdH49K/g1sN9mt0tAuIdn0khb/4AodX7ibZyWYz2R6gHci3ER3fO/F5ieQe25fYlr/6ZiPK700J+DjA5vGuOfeORu6qua92qb+mo+ofe816u/jCzCVaP41Tf5rXFprjfolfGYGoexjigKqLt+zjbx2A1H7/eJR+uQSgreS7cj70oxj8HeJSH2M7XFygG4p1ZQc5TxCWGFbamphARuVPUArRCxrl+Qtnmop93P824sL9fag1o9h2tS9WeteXZB2vyLR/rc0meT/9TY/6f4wHj4P2dp8uWGMBWlTxte2CJjkMDV1JRbUQP5E/YlzAKKouxdw2bGhdYYG2fESu4uMlFWctjZGI93im+AYMxNj6OGN91UENMPjN20ahwmw+HyI4SkFquyS reYeQ3bC cBjijFC5etGPUfxu2Qm8ovno2QJl2gLj4F+ENrm6FdQTOATdQNck6+zyeBgNeQV6sxGnxQj4De8JznXU89cLqm6YKlYZRu8obrlInv1kHEy9b8vFfCXclpbqX/bqq+JpygEnOfKw9nVJhVWkFh2Ar9640OXMqyHMB6dd+w8X24dZ31FFf/4JUC0gs3Cwv01beaqwTC2aETkNEC+lDyhyODkd4hTt39Nn1EFoobru2PUCUL3+1zJjYLZ2KIId33UGrSa08k88IvT+hTHlC0iXXo+qUQ4rIxkSfMF6/KUCc1r/cYFrhM4km/YA/walDsETqfi/pcuJA5qKlS/gUmr64OiiozgHvbR5e0odfjmdIYiJEMapNGyC9ZNUbdoQutVWihC5OaLIN9DGHRkb3YNFRNaqDK9vYmFmm66JMjKbOrtY9mhxwaGjfE/0tyLG4y1QB1vy4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Oct 03, 2023 at 11:01:24AM -0700, Nhat Pham wrote: > On Tue, Oct 3, 2023 at 10:13 AM Mike Kravetz wrote: > > > > On 10/02/23 17:18, Nhat Pham wrote: > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > index de220e3ff8be..74472e911b0a 100644 > > > --- a/mm/hugetlb.c > > > +++ b/mm/hugetlb.c > > > @@ -1902,6 +1902,7 @@ void free_huge_folio(struct folio *folio) > > > pages_per_huge_page(h), folio); > > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > > > pages_per_huge_page(h), folio); > > > + mem_cgroup_uncharge(folio); > > > if (restore_reserve) > > > h->resv_huge_pages++; > > > > > > @@ -3009,11 +3010,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > struct hugepage_subpool *spool = subpool_vma(vma); > > > struct hstate *h = hstate_vma(vma); > > > struct folio *folio; > > > - long map_chg, map_commit; > > > + long map_chg, map_commit, nr_pages = pages_per_huge_page(h); > > > long gbl_chg; > > > - int ret, idx; > > > + int memcg_charge_ret, ret, idx; > > > struct hugetlb_cgroup *h_cg = NULL; > > > + struct mem_cgroup *memcg; > > > bool deferred_reserve; > > > + gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; > > > + > > > + memcg = get_mem_cgroup_from_current(); > > > + memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages); > > > + if (memcg_charge_ret == -ENOMEM) { > > > + mem_cgroup_put(memcg); > > > + return ERR_PTR(-ENOMEM); > > > + } > > > > > > idx = hstate_index(h); > > > /* > > > @@ -3022,8 +3032,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > * code of zero indicates a reservation exists (no change). > > > */ > > > map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); > > > - if (map_chg < 0) > > > + if (map_chg < 0) { > > > + if (!memcg_charge_ret) > > > + mem_cgroup_cancel_charge(memcg, nr_pages); > > > + mem_cgroup_put(memcg); > > > return ERR_PTR(-ENOMEM); > > > + } > > > > > > /* > > > * Processes that did not create the mapping will have no > > > @@ -3034,10 +3048,8 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > */ > > > if (map_chg || avoid_reserve) { > > > gbl_chg = hugepage_subpool_get_pages(spool, 1); > > > - if (gbl_chg < 0) { > > > - vma_end_reservation(h, vma, addr); > > > - return ERR_PTR(-ENOSPC); > > > - } > > > + if (gbl_chg < 0) > > > + goto out_end_reservation; > > > > > > /* > > > * Even though there was no reservation in the region/reserve > > > @@ -3119,6 +3131,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), > > > pages_per_huge_page(h), folio); > > > } > > > + > > > + if (!memcg_charge_ret) > > > + mem_cgroup_commit_charge(folio, memcg); > > > + mem_cgroup_put(memcg); > > > + > > > return folio; > > > > > > out_uncharge_cgroup: > > > @@ -3130,7 +3147,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > > > out_subpool_put: > > > if (map_chg || avoid_reserve) > > > hugepage_subpool_put_pages(spool, 1); > > > +out_end_reservation: > > > vma_end_reservation(h, vma, addr); > > > + if (!memcg_charge_ret) > > > + mem_cgroup_cancel_charge(memcg, nr_pages); > > > + mem_cgroup_put(memcg); > > > return ERR_PTR(-ENOSPC); > > > } > > > > > > > IIUC, huge page usage is charged in alloc_hugetlb_folio and uncharged in > > free_huge_folio. During migration, huge pages are allocated via > > alloc_migrate_hugetlb_folio, not alloc_hugetlb_folio. So, there is no > > charging for the migration target page and we uncharge the source page. > > It looks like there will be no charge for the huge page after migration? > > > > Ah I see! This is a bit subtle indeed. > > For the hugetlb controller, it looks like they update the cgroup info > inside move_hugetlb_state(), which calls hugetlb_cgroup_migrate() > to transfer the hugetlb cgroup info to the destination folio. > > Perhaps we can do something analogous here. > > > If my analysis above is correct, then we may need to be careful about > > this accounting. We may not want both source and target pages to be > > charged at the same time. > > We can create a variant of mem_cgroup_migrate that does not double > charge, but instead just copy the mem_cgroup information to the new > folio, and then clear that info from the old folio. That way the memory > usage counters are untouched. > > Somebody with more expertise on migration should fact check me > of course :) The only reason mem_cgroup_migrate() double charges right now is because it's used by replace_page_cache_folio(). In that context, the isolation of the old page isn't quite as thorough as with migration, so it cannot transfer and uncharge directly. This goes back a long time: 0a31bc97c80c3fa87b32c091d9a930ac19cd0c40 If you rename the current implementation to mem_cgroup_replace_page() for that one caller, you can add a mem_cgroup_migrate() variant which is charge neutral and clears old->memcg_data. This can be used for regular and hugetlb page migration. Something like this (totally untested): diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a4d3282493b6..17ec45bf3653 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7226,29 +7226,14 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) if (mem_cgroup_disabled()) return; - /* Page cache replacement: new folio already charged? */ - if (folio_memcg(new)) - return; - memcg = folio_memcg(old); VM_WARN_ON_ONCE_FOLIO(!memcg, old); if (!memcg) return; - /* Force-charge the new page. The old one will be freed soon */ - if (!mem_cgroup_is_root(memcg)) { - page_counter_charge(&memcg->memory, nr_pages); - if (do_memsw_account()) - page_counter_charge(&memcg->memsw, nr_pages); - } - - css_get(&memcg->css); + /* Transfer the charge and the css ref */ commit_charge(new, memcg); - - local_irq_save(flags); - mem_cgroup_charge_statistics(memcg, nr_pages); - memcg_check_events(memcg, folio_nid(new)); - local_irq_restore(flags); + old->memcg_data = 0; } DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);