From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77A58C433B4 for ; Wed, 21 Apr 2021 07:02:00 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0ED5061438 for ; Wed, 21 Apr 2021 07:02:00 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0ED5061438 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9418B6B006C; Wed, 21 Apr 2021 03:01:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8EFBC6B0078; Wed, 21 Apr 2021 03:01:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7B7F16B007B; Wed, 21 Apr 2021 03:01:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0204.hostedemail.com [216.40.44.204]) by kanga.kvack.org (Postfix) with ESMTP id 5F6456B006C for ; Wed, 21 Apr 2021 03:01:59 -0400 (EDT) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 1442F180AE817 for ; Wed, 21 Apr 2021 07:01:59 +0000 (UTC) X-FDA: 78055479558.09.4876CAE Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) by imf19.hostedemail.com (Postfix) with ESMTP id DCBD790009F6 for ; Wed, 21 Apr 2021 07:01:35 +0000 (UTC) Received: by mail-pf1-f169.google.com with SMTP id h15so10133652pfv.2 for ; Wed, 21 Apr 2021 00:01:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=d1sEXEKb6Y/ergy2IuDxdstYwJCAcamZd+Wew/VvE7w=; b=x3z0jC5MfLDBbc1Xoe+r+vCOcaSlpea8XYtrJ8q+Dis23HjLhNMyzcsREG3p3Qy6LY T5yIszaGdphYTem3/CGVZ4mqTPfC9gShu6wSOx0XKzrpmqrfjxoYJuj6tAW4+vHYs/ZJ 1JyZjQcG3yh++m1WlHMFTgBI1mbajoxkIh558f12tIoipZ+Z5AX5+PVBFzHTaPeIhzY2 zGJW7YLeLTgKsHWhDyysuAwFAKQMfR9rcBQL24b5WQHc0YTb3zwjBUrr+8/wEVQike+i Y64yE2rYoCgfEwspko2CL5XTU9IElC0zlc3h+KszFBUNscJ3JhULrNJpl3p94KxkJug/ hWew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=d1sEXEKb6Y/ergy2IuDxdstYwJCAcamZd+Wew/VvE7w=; b=cP+yZ2o/S9t5luUgytPu9/YJw16SabFQfMvUpMZ9HH86gH4IWTIWSzhVbdmbTDsb/T b5BMusOHAwaWWEkRvG2FF4SJj1Z1aTzvm/i9P2dUmsK0MQEzQ7nlNRVA8eELZpBQY/8V DnQg6BxWfUhv/Z0XMkfTTOU1KAluh1qJ+gfaWEA5lewPpfW0Y/HzqDGUiDNOW8POWydg 9pFuMfEQdSRlFkRs+Zj0QEZ77rLJkCtJ8MKERvSLc7SkS7NsnSWxyPwbIOzVNQhhPS7M dopYybamuS9YMwA5v+M7TpJ5KZtPpnBz8iOwi3ZvDXqdjQ+ZrHxpI17aJfuhx2ua0Kre NpZw== X-Gm-Message-State: AOAM5300S2UYmK1Gm+Dd79G+XNRRqomci/0cgnTo18e/8VYCl/crKg98 bE+wFW1i3/1yoCQ6krhOfymdJg== X-Google-Smtp-Source: ABdhPJy/hL/+gJD0mKWPQcQvKloQRgomI8vu7HVMOCyIUQpFzWp3qnZSwpLvn3R0+vUQm8VksfOzFw== X-Received: by 2002:a17:90b:3887:: with SMTP id mu7mr9375367pjb.236.1618988517743; Wed, 21 Apr 2021 00:01:57 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.255]) by smtp.gmail.com with ESMTPSA id c4sm929842pfb.94.2021.04.21.00.01.53 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 21 Apr 2021 00:01:57 -0700 (PDT) From: Muchun Song To: guro@fb.com, hannes@cmpxchg.org, mhocko@kernel.org, akpm@linux-foundation.org, shakeelb@google.com, vdavydov.dev@gmail.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, fam.zheng@bytedance.com, bsingharora@gmail.com, shy828301@gmail.com, alex.shi@linux.alibaba.com, Muchun Song Subject: [RFC PATCH v3 07/12] mm: memcontrol: make all the callers of page_memcg() safe Date: Wed, 21 Apr 2021 15:00:54 +0800 Message-Id: <20210421070059.69361-8-songmuchun@bytedance.com> X-Mailer: git-send-email 2.21.0 (Apple Git-122) In-Reply-To: <20210421070059.69361-1-songmuchun@bytedance.com> References: <20210421070059.69361-1-songmuchun@bytedance.com> MIME-Version: 1.0 X-Stat-Signature: fs5wgn3umom5km9mg9o3xtgiy3xi1mko X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: DCBD790009F6 Received-SPF: none (bytedance.com>: No applicable sender policy available) receiver=imf19; identity=mailfrom; envelope-from=""; helo=mail-pf1-f169.google.com; client-ip=209.85.210.169 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1618988495-586857 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When we use objcg APIs to charge the LRU pages, the page will not hold a reference to the memcg associated with the page. So the caller of the page_memcg() should hold an rcu read lock or obtain a reference to the memcg associated with the page to protect memcg from being released. So introduce get_mem_cgroup_from_page() to obtain a reference to the memory cgroup associated with the page. In this patch, make all the callers hold an rcu read lock or obtain a reference to the memcg to protect memcg from being released when the LRU pages reparented. We do not need to adjust the callers of page_memcg() during the whole process of mem_cgroup_move_task(). Because the cgroup migration and memory cgroup offlining are serialized by @cgroup_mutex. In this routine, the LRU pages cannot be reparented to its parent memory cgroup. So page_memcg(page) is stable and cannot be released. This is a preparation for reparenting the LRU pages. Signed-off-by: Muchun Song --- fs/buffer.c | 3 ++- fs/fs-writeback.c | 23 +++++++++++---------- include/linux/memcontrol.h | 39 ++++++++++++++++++++++++++++++++--- mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++----= ------ mm/migrate.c | 4 ++++ mm/page_io.c | 5 +++-- 6 files changed, 97 insertions(+), 28 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 673cfbef9eec..a542a47f6e27 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -848,7 +848,7 @@ struct buffer_head *alloc_page_buffers(struct page *p= age, unsigned long size, gfp |=3D __GFP_NOFAIL; =20 /* The page lock pins the memcg */ - memcg =3D page_memcg(page); + memcg =3D get_mem_cgroup_from_page(page); old_memcg =3D set_active_memcg(memcg); =20 head =3D NULL; @@ -868,6 +868,7 @@ struct buffer_head *alloc_page_buffers(struct page *p= age, unsigned long size, set_bh_page(bh, page, offset); } out: + mem_cgroup_put(memcg); set_active_memcg(old_memcg); return head; /* diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index e91980f49388..3ac002561327 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -255,15 +255,13 @@ void __inode_attach_wb(struct inode *inode, struct = page *page) if (inode_cgwb_enabled(inode)) { struct cgroup_subsys_state *memcg_css; =20 - if (page) { - memcg_css =3D mem_cgroup_css_from_page(page); - wb =3D wb_get_create(bdi, memcg_css, GFP_ATOMIC); - } else { - /* must pin memcg_css, see wb_get_create() */ + /* must pin memcg_css, see wb_get_create() */ + if (page) + memcg_css =3D get_mem_cgroup_css_from_page(page); + else memcg_css =3D task_get_css(current, memory_cgrp_id); - wb =3D wb_get_create(bdi, memcg_css, GFP_ATOMIC); - css_put(memcg_css); - } + wb =3D wb_get_create(bdi, memcg_css, GFP_ATOMIC); + css_put(memcg_css); } =20 if (!wb) @@ -736,16 +734,16 @@ void wbc_account_cgroup_owner(struct writeback_cont= rol *wbc, struct page *page, if (!wbc->wb || wbc->no_cgroup_owner) return; =20 - css =3D mem_cgroup_css_from_page(page); + css =3D get_mem_cgroup_css_from_page(page); /* dead cgroups shouldn't contribute to inode ownership arbitration */ if (!(css->flags & CSS_ONLINE)) - return; + goto out; =20 id =3D css->id; =20 if (id =3D=3D wbc->wb_id) { wbc->wb_bytes +=3D bytes; - return; + goto out; } =20 if (id =3D=3D wbc->wb_lcand_id) @@ -758,6 +756,9 @@ void wbc_account_cgroup_owner(struct writeback_contro= l *wbc, struct page *page, wbc->wb_tcand_bytes +=3D bytes; else wbc->wb_tcand_bytes -=3D min(bytes, wbc->wb_tcand_bytes); + +out: + css_put(css); } EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner); =20 diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index cb0d99583f77..228263f2c82b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -381,7 +381,7 @@ static inline bool PageMemcgKmem(struct page *page); * a valid memcg, but can be atomically swapped to the parent memcg. * * The caller must ensure that the returned memcg won't be released: - * e.g. acquire the rcu_read_lock or css_set_lock. + * e.g. acquire the rcu_read_lock or css_set_lock or cgroup_mutex. */ static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *obj= cg) { @@ -459,6 +459,31 @@ static inline struct mem_cgroup *page_memcg(struct p= age *page) } =20 /* + * get_mem_cgroup_from_page - Obtain a reference on the memory cgroup as= sociated + * with a page + * @page: a pointer to the page struct + * + * Returns a pointer to the memory cgroup (and obtain a reference on it) + * associated with the page, or NULL. This function assumes that the pag= e + * is known to have a proper memory cgroup pointer. It's not safe to cal= l + * this function against some type of pages, e.g. slab pages or ex-slab + * pages. + */ +static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *p= age) +{ + struct mem_cgroup *memcg; + + rcu_read_lock(); +retry: + memcg =3D page_memcg(page); + if (unlikely(memcg && !css_tryget(&memcg->css))) + goto retry; + rcu_read_unlock(); + + return memcg; +} + +/* * page_memcg_rcu - locklessly get the memory cgroup associated with a p= age * @page: a pointer to the page struct * @@ -871,7 +896,7 @@ static inline bool mm_match_cgroup(struct mm_struct *= mm, return match; } =20 -struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page); +struct cgroup_subsys_state *get_mem_cgroup_css_from_page(struct page *pa= ge); ino_t page_cgroup_ino(struct page *page); =20 static inline bool mem_cgroup_online(struct mem_cgroup *memcg) @@ -1031,10 +1056,13 @@ static inline void count_memcg_events(struct mem_= cgroup *memcg, static inline void count_memcg_page_event(struct page *page, enum vm_event_item idx) { - struct mem_cgroup *memcg =3D page_memcg(page); + struct mem_cgroup *memcg; =20 + rcu_read_lock(); + memcg =3D page_memcg(page); if (memcg) count_memcg_events(memcg, idx, 1); + rcu_read_unlock(); } =20 static inline void count_memcg_event_mm(struct mm_struct *mm, @@ -1108,6 +1136,11 @@ static inline struct mem_cgroup *page_memcg(struct= page *page) return NULL; } =20 +static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *p= age) +{ + return NULL; +} + static inline struct mem_cgroup *page_memcg_rcu(struct page *page) { WARN_ON_ONCE(!rcu_read_lock_held()); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fd8e2c242726..a48403e5999c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -410,7 +410,7 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key); #endif =20 /** - * mem_cgroup_css_from_page - css of the memcg associated with a page + * get_mem_cgroup_css_from_page - get css of the memcg associated with a= page * @page: page of interest * * If memcg is bound to the default hierarchy, css of the memcg associat= ed @@ -420,13 +420,15 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key); * If memcg is bound to a traditional hierarchy, the css of root_mem_cgr= oup * is returned. */ -struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page) +struct cgroup_subsys_state *get_mem_cgroup_css_from_page(struct page *pa= ge) { struct mem_cgroup *memcg; =20 - memcg =3D page_memcg(page); + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) + return &root_mem_cgroup->css; =20 - if (!memcg || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) + memcg =3D get_mem_cgroup_from_page(page); + if (!memcg) memcg =3D root_mem_cgroup; =20 return &memcg->css; @@ -1997,7 +1999,9 @@ void lock_page_memcg(struct page *page) * The RCU lock is held throughout the transaction. The fast * path can get away without acquiring the memcg->move_lock * because page moving starts with an RCU grace period. - */ + * + * The RCU lock also protects the memcg from being freed. + */ rcu_read_lock(); =20 if (mem_cgroup_disabled()) @@ -4415,7 +4419,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, = unsigned long *pfilepages, void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, struct bdi_writeback *wb) { - struct mem_cgroup *memcg =3D page_memcg(page); + struct mem_cgroup *memcg; struct memcg_cgwb_frn *frn; u64 now =3D get_jiffies_64(); u64 oldest_at =3D now; @@ -4424,6 +4428,7 @@ void mem_cgroup_track_foreign_dirty_slowpath(struct= page *page, =20 trace_track_foreign_dirty(page, wb); =20 + memcg =3D get_mem_cgroup_from_page(page); /* * Pick the slot to use. If there is already a slot for @wb, keep * using it. If not replace the oldest one which isn't being @@ -4462,6 +4467,7 @@ void mem_cgroup_track_foreign_dirty_slowpath(struct= page *page, frn->memcg_id =3D wb->memcg_css->id; frn->at =3D now; } + css_put(&memcg->css); } =20 /* issue foreign writeback flushes for recorded foreign dirtying events = */ @@ -5992,6 +5998,14 @@ static void mem_cgroup_move_charge(void) atomic_dec(&mc.from->moving_account); } =20 +/* + * The cgroup migration and memory cgroup offlining are serialized by + * @cgroup_mutex. If we reach here, it means that the LRU pages cannot + * be reparented to its parent memory cgroup. So during the whole proces= s + * of mem_cgroup_move_task(), page_memcg(page) is stable. So we do not + * need to worry about the memcg (returned from page_memcg()) being + * released even if we do not hold an rcu read lock. + */ static void mem_cgroup_move_task(void) { if (mc.to) { @@ -6819,7 +6833,7 @@ void mem_cgroup_migrate(struct page *oldpage, struc= t page *newpage) if (page_memcg(newpage)) return; =20 - memcg =3D page_memcg(oldpage); + memcg =3D get_mem_cgroup_from_page(oldpage); VM_WARN_ON_ONCE_PAGE(!memcg, oldpage); if (!memcg) return; @@ -6840,6 +6854,8 @@ void mem_cgroup_migrate(struct page *oldpage, struc= t page *newpage) mem_cgroup_charge_statistics(memcg, newpage, nr_pages); memcg_check_events(memcg, newpage); local_irq_restore(flags); + + css_put(&memcg->css); } =20 DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key); @@ -7028,6 +7044,10 @@ void mem_cgroup_swapout(struct page *page, swp_ent= ry_t entry) if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) return; =20 + /* + * Interrupts should be disabled by the caller (see the comments below)= , + * which can serve as RCU read-side critical sections. + */ memcg =3D page_memcg(page); =20 VM_WARN_ON_ONCE_PAGE(!memcg, page); @@ -7095,15 +7115,16 @@ int mem_cgroup_try_charge_swap(struct page *page,= swp_entry_t entry) if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return 0; =20 + rcu_read_lock(); memcg =3D page_memcg(page); =20 VM_WARN_ON_ONCE_PAGE(!memcg, page); if (!memcg) - return 0; + goto out; =20 if (!entry.val) { memcg_memory_event(memcg, MEMCG_SWAP_FAIL); - return 0; + goto out; } =20 memcg =3D mem_cgroup_id_get_online(memcg); @@ -7113,6 +7134,7 @@ int mem_cgroup_try_charge_swap(struct page *page, s= wp_entry_t entry) memcg_memory_event(memcg, MEMCG_SWAP_MAX); memcg_memory_event(memcg, MEMCG_SWAP_FAIL); mem_cgroup_id_put(memcg); + rcu_read_unlock(); return -ENOMEM; } =20 @@ -7122,6 +7144,8 @@ int mem_cgroup_try_charge_swap(struct page *page, s= wp_entry_t entry) oldid =3D swap_cgroup_record(entry, mem_cgroup_id(memcg), nr_pages); VM_BUG_ON_PAGE(oldid, page); mod_memcg_state(memcg, MEMCG_SWAP, nr_pages); +out: + rcu_read_unlock(); =20 return 0; } @@ -7176,17 +7200,22 @@ bool mem_cgroup_swap_full(struct page *page) if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys)) return false; =20 + rcu_read_lock(); memcg =3D page_memcg(page); if (!memcg) - return false; + goto out; =20 for (; memcg !=3D root_mem_cgroup; memcg =3D parent_mem_cgroup(memcg)) = { unsigned long usage =3D page_counter_read(&memcg->swap); =20 if (usage * 2 >=3D READ_ONCE(memcg->swap.high) || - usage * 2 >=3D READ_ONCE(memcg->swap.max)) + usage * 2 >=3D READ_ONCE(memcg->swap.max)) { + rcu_read_unlock(); return true; + } } +out: + rcu_read_unlock(); =20 return false; } diff --git a/mm/migrate.c b/mm/migrate.c index b234c3f3acb7..9256693a9979 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -463,6 +463,10 @@ int migrate_page_move_mapping(struct address_space *= mapping, struct lruvec *old_lruvec, *new_lruvec; struct mem_cgroup *memcg; =20 + /* + * Irq is disabled, which can serve as RCU read-side critical + * sections. + */ memcg =3D page_memcg(page); old_lruvec =3D mem_cgroup_lruvec(memcg, oldzone->zone_pgdat); new_lruvec =3D mem_cgroup_lruvec(memcg, newzone->zone_pgdat); diff --git a/mm/page_io.c b/mm/page_io.c index c493ce9ebcf5..81744777ab76 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -269,13 +269,14 @@ static void bio_associate_blkg_from_page(struct bio= *bio, struct page *page) struct cgroup_subsys_state *css; struct mem_cgroup *memcg; =20 + rcu_read_lock(); memcg =3D page_memcg(page); if (!memcg) - return; + goto out; =20 - rcu_read_lock(); css =3D cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys); bio_associate_blkg_from_css(bio, css); +out: rcu_read_unlock(); } #else --=20 2.11.0