From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1AE35C3A5A3 for ; Thu, 22 Aug 2019 12:57:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CD77A2173E for ; Thu, 22 Aug 2019 12:57:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CD77A2173E Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 502266B0313; Thu, 22 Aug 2019 08:57:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B3C76B0316; Thu, 22 Aug 2019 08:57:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3CA4F6B0318; Thu, 22 Aug 2019 08:57:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0222.hostedemail.com [216.40.44.222]) by kanga.kvack.org (Postfix) with ESMTP id 1C6786B0313 for ; Thu, 22 Aug 2019 08:57:01 -0400 (EDT) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 9E9A3181AC9B6 for ; Thu, 22 Aug 2019 12:57:00 +0000 (UTC) X-FDA: 75850063800.13.top06_5a401f7c2621f X-HE-Tag: top06_5a401f7c2621f X-Filterd-Recvd-Size: 7435 Received: from mx1.suse.de (mx2.suse.de [195.135.220.15]) by imf28.hostedemail.com (Postfix) with ESMTP for ; Thu, 22 Aug 2019 12:57:00 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id C9C4CADBB; Thu, 22 Aug 2019 12:56:58 +0000 (UTC) Subject: Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable To: Michal Hocko , kirill.shutemov@linux.intel.com, Yang Shi Cc: hannes@cmpxchg.org, rientjes@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1566410125-66011-1-git-send-email-yang.shi@linux.alibaba.com> <20190822080434.GF12785@dhcp22.suse.cz> From: Vlastimil Babka Message-ID: Date: Thu, 22 Aug 2019 14:56:56 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <20190822080434.GF12785@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 8/22/19 10:04 AM, Michal Hocko wrote: > On Thu 22-08-19 01:55:25, Yang Shi wrote: >> Available memory is one of the most important metrics for memory >> pressure. > > I would disagree with this statement. It is a rough estimate that tells > how much memory you can allocate before going into a more expensive > reclaim (mostly swapping). Allocating that amount still might result in > direct reclaim induced stalls. I do realize that this is simple metric > that is attractive to use and works in many cases though. > >> Currently, the deferred split THPs are not accounted into >> available memory, but they are reclaimable actually, like reclaimable >> slabs. >> >> And, they seems very common with the common workloads when THP is >> enabled. A simple run with MariaDB test of mmtest with THP enabled as >> always shows it could generate over fifteen thousand deferred split THPs >> (accumulated around 30G in one hour run, 75% of 40G memory for my VM). >> It looks worth accounting in MemAvailable. > > OK, this makes sense. But your above numbers are really worrying. > Accumulating such a large amount of pages that are likely not going to > be used is really bad. They are essentially blocking any higher order > allocations and also push the system towards more memory pressure. > > IIUC deferred splitting is mostly a workaround for nasty locking issues > during splitting, right? This is not really an optimization to cache > THPs for reuse or something like that. What is the reason this is not > done from a worker context? At least THPs which would be freed > completely sound like a good candidate for kworker tear down, no? Agreed that it's a good question. For Kirill :) Maybe with kworker approach we also wouldn't need the cgroup awareness? >> Record the number of freeable normal pages of deferred split THPs into >> the second tail page, and account it into KReclaimable. Although THP >> allocations are not exactly "kernel allocations", once they are unmapped, >> they are in fact kernel-only. KReclaimable has been accounted into >> MemAvailable. > > This sounds reasonable to me. > >> When the deferred split THPs get split due to memory pressure or freed, >> just decrease by the recorded number. >> >> With this change when running program which populates 1G address space >> then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would >> show the deferred split THPs are accounted properly. >> >> Populated by before calling madvise(MADV_DONTNEED): >> MemAvailable: 43531960 kB >> AnonPages: 1096660 kB >> KReclaimable: 26156 kB >> AnonHugePages: 1056768 kB >> >> After calling madvise(MADV_DONTNEED): >> MemAvailable: 44411164 kB >> AnonPages: 50140 kB >> KReclaimable: 1070640 kB >> AnonHugePages: 10240 kB >> >> Suggested-by: Vlastimil Babka >> Cc: Michal Hocko >> Cc: Kirill A. Shutemov >> Cc: Johannes Weiner >> Cc: David Rientjes >> Signed-off-by: Yang Shi Thanks, looks like it wasn't too difficult with the 2nd tail page use :) ... >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -524,6 +524,7 @@ void prep_transhuge_page(struct page *page) >> >> INIT_LIST_HEAD(page_deferred_list(page)); >> set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR); >> + page[2].nr_freeable = 0; >> } >> >> static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len, >> @@ -2766,6 +2767,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) >> if (!list_empty(page_deferred_list(head))) { >> ds_queue->split_queue_len--; >> list_del(page_deferred_list(head)); >> + __mod_node_page_state(page_pgdat(page), >> + NR_KERNEL_MISC_RECLAIMABLE, >> + -head[2].nr_freeable); >> + head[2].nr_freeable = 0; >> } >> if (mapping) >> __dec_node_page_state(page, NR_SHMEM_THPS); >> @@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page) >> ds_queue->split_queue_len--; >> list_del(page_deferred_list(page)); >> } >> + __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE, >> + -page[2].nr_freeable); >> + page[2].nr_freeable = 0; Wouldn't it be safer to fully tie the nr_freeable use to adding the page to the deffered list? So here the code would be in the if (!list_empty()) { } part above. >> spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> free_compound_page(page); >> } >> >> -void deferred_split_huge_page(struct page *page) >> +void deferred_split_huge_page(struct page *page, unsigned int nr) >> { >> struct deferred_split *ds_queue = get_deferred_split_queue(page); >> #ifdef CONFIG_MEMCG >> @@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page) >> return; >> >> spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >> + page[2].nr_freeable += nr; >> + __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE, >> + nr); Same here, only do this when adding to the list, below? Or we might perhaps account base pages multiple times? >> if (list_empty(page_deferred_list(page))) { >> count_vm_event(THP_DEFERRED_SPLIT_PAGE); >> list_add_tail(page_deferred_list(page), &ds_queue->split_queue); >> diff --git a/mm/rmap.c b/mm/rmap.c >> index e5dfe2a..6008fab 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1286,7 +1286,7 @@ static void page_remove_anon_compound_rmap(struct page *page) >> >> if (nr) { >> __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr); >> - deferred_split_huge_page(page); >> + deferred_split_huge_page(page, nr); >> } >> } >> >> @@ -1320,7 +1320,7 @@ void page_remove_rmap(struct page *page, bool compound) >> clear_page_mlock(page); >> >> if (PageTransCompound(page)) >> - deferred_split_huge_page(compound_head(page)); >> + deferred_split_huge_page(compound_head(page), 1); >> >> /* >> * It would be tidy to reset the PageAnon mapping here, >> -- >> 1.8.3.1 >