From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98414C433DB for ; Wed, 31 Mar 2021 03:10:07 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 89C8F619E8 for ; Wed, 31 Mar 2021 03:10:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 89C8F619E8 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EFC956B007E; Tue, 30 Mar 2021 23:10:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EAD696B0081; Tue, 30 Mar 2021 23:10:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D11E46B0082; Tue, 30 Mar 2021 23:10:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0172.hostedemail.com [216.40.44.172]) by kanga.kvack.org (Postfix) with ESMTP id B1B3E6B007E for ; Tue, 30 Mar 2021 23:10:05 -0400 (EDT) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 5ACB7181AEF3E for ; Wed, 31 Mar 2021 03:10:05 +0000 (UTC) X-FDA: 77978690370.04.72C8BFB Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf18.hostedemail.com (Postfix) with ESMTP id DF9F22000253 for ; Wed, 31 Mar 2021 03:10:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=wclIb0aikcrfxhifwFHY+8mr5NJlHSitwxXWVnLxzI0=; b=N2Jh4dHHgDY4vIkn8Gt9RTipcn QgOUV5bRWLRAAgp0xh19/G9Uo8/K/kEDYJRAkYmD9yVUwCFkglhcmtzrXif1OVVJHn4OLt7BDUQb1 CZHs1mLOljma37VECaHHZecsFW/0wicnfJB3H0baNKkaYUVm+S0Mk63UCzqiecFaW6L/o8BXaTQ0R 7Ze5rZN8uEnZWewti25fb4axU/hxr+O5X0e4SANnl3E/js7zx7VpRczWNU2jMrwfxsgPAGEHQkuD5 iD8i2uKXkOtPqukkKjJHKu2MvJpwgGve50o8KBk0HzSylYSFMKAbdPtXU4G6enI20k1pGTPOS6JkT WdVQ6/hA==; Received: from willy by casper.infradead.org with local (Exim 4.94 #2 (Red Hat Linux)) id 1lRREd-003wsP-3t; Wed, 31 Mar 2021 03:09:38 +0000 Date: Wed, 31 Mar 2021 04:09:35 +0100 From: Matthew Wilcox To: Roman Gushchin Cc: Zi Yan , linux-mm@kvack.org, "Kirill A . Shutemov" , Andrew Morton , Yang Shi , Michal Hocko , John Hubbard , Ralph Campbell , David Nellans , Jason Gunthorpe , David Rientjes , Vlastimil Babka , David Hildenbrand , Mike Kravetz , Song Liu Subject: Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64 Message-ID: <20210331030935.GT351017@casper.infradead.org> References: <20210224223536.803765-1-zi.yan@sent.com> <890DE8FE-DAF6-49A2-8C62-40B6FD593B4A@nvidia.com> <06D1034A-DE8B-4970-9056-6CA1C436D2E8@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: DF9F22000253 X-Stat-Signature: 5f6535h843snoafhsb8fht4cb5mxwzmy Received-SPF: none (infradead.org>: No applicable sender policy available) receiver=imf18; identity=mailfrom; envelope-from=""; helo=casper.infradead.org; client-ip=90.155.50.34 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1617160203-696244 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Mar 30, 2021 at 11:02:07AM -0700, Roman Gushchin wrote: > On Tue, Mar 30, 2021 at 01:24:14PM -0400, Zi Yan wrote: > > On 4 Mar 2021, at 11:45, Roman Gushchin wrote: > > > I actually ran a large scale experiment (on tens of thousands of machines) in the last > > > several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same. > > > > Thanks for the information. I finally have time to come back to this. Do you mind sharing > > the total memory of these machines? I want to have some idea on the scale of this issue to > > make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs, > > or TBs memory? > > There are different configurations, but in general they are in 100's GB or smaller. Are you using ZONE_MOVEABLE? Seeing /proc/buddyinfo from one of these machines might be illuminating. > > > > > > > > My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory). > > > Without cma chances are reaching 0% very fast after reboot, and even manual manipulations > > > like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not > > > help much. Sometimes you can allocate maybe 1-2 pages, but that's about it. > > > > Is there a way of replicating such an environment with publicly available software? > > I really want to understand the root cause and am willing to find a possible solution. > > It would be much easier if I can reproduce this locally. > > There is nothing fb-specific: once the memory is filled with anon/pagecache, any subsequent > allocations of non-movable memory (slabs, percpu, etc) will fragment the memory. There > is a pageblock mechanism which prevents the fragmentation on 2MB scale, but nothing prevents > the fragmentation on 1GB scale. It just a matter of runtime (and the number of mm operations). I think this is somewhere the buddy allocator could be improved. Of course, it knows nothing of larger page orders (which needs to be fixed), but in general, I would like it to do a better job of segregating movable and unmovable allocations. Let's take a machine with 100GB of memory as an example. Ideally, unmovable allocations would start at 4GB (assuming below 4GB is ZONE_DMA32). Movable allocations can allocate anywhere in memory, but should avoid being "near" unmovable allocations. Perhaps they start at 5GB. When unmovable allocations get up to 5GB, we should first exert a bit of pressure to shrink the unmovable allocations (looking at you, dcache), but eventually we'll need to grow the unmovable allocations above 5GB and we should move, say, all the pages between 5GB and 5GB+1MB. If this unmovable allocation was just temporary, we get a reassembled 1MB page. If it was permanent, we now have 1MB of memory to soak up the next few allocations. The model I'm thinking of here is that we have a "line" in memory that divides movable and unmovable allocations. It can move up, but there has to be significant memory pressure to do so.