From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,MSGID_FROM_MTA_HEADER,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E99C7C433E0 for ; Tue, 30 Mar 2021 18:02:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 556CF619CA for ; Tue, 30 Mar 2021 18:02:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 556CF619CA Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=fb.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 94D7A6B0080; Tue, 30 Mar 2021 14:02:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8FF6C6B0081; Tue, 30 Mar 2021 14:02:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 706906B0082; Tue, 30 Mar 2021 14:02:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0127.hostedemail.com [216.40.44.127]) by kanga.kvack.org (Postfix) with ESMTP id 537A06B0080 for ; Tue, 30 Mar 2021 14:02:23 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 03A034DAD for ; Tue, 30 Mar 2021 18:02:23 +0000 (UTC) X-FDA: 77977310166.20.E603CB5 Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by imf19.hostedemail.com (Postfix) with ESMTP id AF64C90009EC for ; Tue, 30 Mar 2021 18:02:22 +0000 (UTC) Received: from pps.filterd (m0089730.ppops.net [127.0.0.1]) by m0089730.ppops.net (8.16.0.43/8.16.0.43) with SMTP id 12UHxq0a023602; Tue, 30 Mar 2021 11:02:16 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=date : from : to : cc : subject : message-id : references : content-type : in-reply-to : mime-version; s=facebook; bh=lTpHAUawWLzOBTzzbc0oExrWcPrzR5dtruoscTgOV+w=; b=nuy87vuvTPRUZltmqLXOj3yuEwBEVKm2PbLE7kHk9PWNr86BhprLiF4eRoNMQXIyUVC9 T8oezQReSttV4kd4JpWr0RxOq9yVQuchZCe+LDq9BRugqXBHsIMAil+eR3B42PZChxiC UljFBZAtazXPBplluCJ9egMWCuiAgiQhgJ4= Received: from maileast.thefacebook.com ([163.114.130.16]) by m0089730.ppops.net with ESMTP id 37kuvm4ag5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT); Tue, 30 Mar 2021 11:02:16 -0700 Received: from NAM10-DM6-obe.outbound.protection.outlook.com (100.104.31.183) by o365-in.thefacebook.com (100.104.36.101) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Tue, 30 Mar 2021 11:02:14 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Rphfjksj0LadhNEhxdQd6vqlVURXOxNpLh5HWoNyIdoEir+6/z0bBnDq9e/DMQIE4lu4UC+EOZMZNsATZKCCRjlZD6Wuysa+33kkmbOANzCmr3J/uS3DT7UURz6wGV3+oa+gDaGXSfozXhKQIR3YIQSryo620925iH3HccwulzwrIu6IT7N6qlE4iciIKzL4VVkMjj7K8nrOxyaBlQYM2QGUOQN/t+UZlB8OsxSNG8Wr4STrf+yUZzh3hidfFEnBKCj8dH/AfZspaqkwiAKh+9f9ZxJsHPJLQEbDvIoewQ9h1YjO0Y6m61xKzriVe2P+sv9vfQMm0FljS73Yw0JJQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=lTpHAUawWLzOBTzzbc0oExrWcPrzR5dtruoscTgOV+w=; b=A2QLZH3wqcYEu+y742b5x7PTD7rFqsN+Qt43mkvVK4Zrg/fsbvqsSIgBcFKxif6JG6UcHeLsWwtw2o/rKxxvnfTdEnnRUNp8PzwZqMd6It/WZfEM+ceW6RvrfHrZ3vLPeVjE4JBt9fTWnOptgQ2Jn7Mq2MBDGKvUUb1oFGxkERwL8e378nG3M45aL/I5ibjzZ5wGyJRwUarWA9rrRB2Ho1fEXMGEEptW3nTc0f0kL6MIbXpbHes7+hGaY0KqwnakXtZ9ljDV8YVPDNK6tErMNBlVbxwtX96sQgSfHMgb7ohwHGfqoch0U6zVc5Z3xAFpBtgxUuqyONYTiPyDF90Gow== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=fb.com; dmarc=pass action=none header.from=fb.com; dkim=pass header.d=fb.com; arc=none Authentication-Results: nvidia.com; dkim=none (message not signed) header.d=none;nvidia.com; dmarc=none action=none header.from=fb.com; Received: from BYAPR15MB4136.namprd15.prod.outlook.com (2603:10b6:a03:96::24) by BYAPR15MB2278.namprd15.prod.outlook.com (2603:10b6:a02:8e::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3933.32; Tue, 30 Mar 2021 18:02:13 +0000 Received: from BYAPR15MB4136.namprd15.prod.outlook.com ([fe80::2c3d:df54:e11c:ee99]) by BYAPR15MB4136.namprd15.prod.outlook.com ([fe80::2c3d:df54:e11c:ee99%6]) with mapi id 15.20.3977.033; Tue, 30 Mar 2021 18:02:13 +0000 Date: Tue, 30 Mar 2021 11:02:07 -0700 From: Roman Gushchin To: Zi Yan CC: , Matthew Wilcox , "Kirill A . Shutemov" , Andrew Morton , Yang Shi , Michal Hocko , John Hubbard , Ralph Campbell , David Nellans , Jason Gunthorpe , David Rientjes , Vlastimil Babka , David Hildenbrand , Mike Kravetz , Song Liu Subject: Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64 Message-ID: References: <20210224223536.803765-1-zi.yan@sent.com> <890DE8FE-DAF6-49A2-8C62-40B6FD593B4A@nvidia.com> <06D1034A-DE8B-4970-9056-6CA1C436D2E8@nvidia.com> Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <06D1034A-DE8B-4970-9056-6CA1C436D2E8@nvidia.com> X-Originating-IP: [2620:10d:c090:400::5:ae92] X-ClientProxiedBy: MWHPR22CA0010.namprd22.prod.outlook.com (2603:10b6:300:ef::20) To BYAPR15MB4136.namprd15.prod.outlook.com (2603:10b6:a03:96::24) X-MS-Exchange-MessageSentRepresentingType: 1 Received: from carbon.dhcp.thefacebook.com (2620:10d:c090:400::5:ae92) by MWHPR22CA0010.namprd22.prod.outlook.com (2603:10b6:300:ef::20) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3977.24 via Frontend Transport; Tue, 30 Mar 2021 18:02:11 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 45d28c10-5443-4fc5-6e06-08d8f3a5f20f X-MS-TrafficTypeDiagnostic: BYAPR15MB2278: X-MS-Exchange-Transport-Forked: True X-Microsoft-Antispam-PRVS: X-FB-Source: Internal X-MS-Oob-TLC-OOBClassifiers: OLM:9508; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: gl7ezeR5POL2fbwZVRjdHmrqqS+tXKfSl5Nwq3JlYS9YlflvBTG4/vf8cQYAXyD/HwriKrkACSYn5kUnfpNwKXSL4WY8b4zz31zFf/KdMeQU10G35Bh4pHBdZjRcoOwd58yAM89515TEUbftw9ywy6ZnseTB+fgvJe8dkg/N1ZUbXTAvqFG7hADKXhvY/3ZtsOf3ptWZr7yHXsCM2kzjbV3wbxabWYsb7RLa/1i0XvAVvRDUAot73JVJMnMQYjH3vL0OVd9EYhYvQOPnKRSmXW8x/M8Qfy29M1QLSvRCWFF1hSXQ88HDVqTHUjS1AxEG/yycVJGU44+lk9htnFc4vbmBAHR6yAnSaLd9CX0uTGI1aXoFvdlki4UQIA12Ym6xo3L8mNgZmmvJBd78AgC8GUNLJrm97EI15TVOfrvPRrWdklECyuG8RRIW+JadAwSxQt6BV+0SC/08k57Afio9kDj7SR6AUw7UhmcijWqsEZlO+7TlJUIfRp/5qMxe1178YhCfkVeFXqw50yaq7wTYGiN49pKruOdMcxK2Sd4ISPOA5dn5lUh2R9dGzfDrpa7sKig9LDyTXTG/OkPewuHHBOxpXHDvYoHnvB9JJlNiBsdAhsFwDRZnhZIVDnsdAeTDkyI6eY772H54CYO0rfJIhe6BJwJMAwkgljvYmzSxxRsqAZklTw4XCiNGD/lKzN82XtfmjV/zrQgYSFNfqglPKRlW5/EjsXKg9sQPTQw9R7k= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:BYAPR15MB4136.namprd15.prod.outlook.com;PTR:;CAT:NONE;SFS:(376002)(39860400002)(346002)(366004)(396003)(136003)(966005)(8676002)(5660300002)(478600001)(4326008)(66476007)(9686003)(52116002)(7416002)(53546011)(6506007)(66946007)(55016002)(66556008)(186003)(316002)(16526019)(2906002)(6666004)(6916009)(86362001)(8936002)(38100700001)(54906003)(83380400001)(7696005);DIR:OUT;SFP:1102; X-MS-Exchange-AntiSpam-MessageData: =?us-ascii?Q?OkCFT1AXXkG2Xu6Qtvbz5kvNpyWK2bpkmfgzIzA2oB/BYlQgM0Xv775GDRBR?= =?us-ascii?Q?pYC390n2HPWKhv9gKT4idtsopmqSimhvfVcs4bZPXaGNR9iuehUxUptdsSo9?= =?us-ascii?Q?p+q+Px7QJrZtvuHskyE08qn1mWfpZ8eMMCVc5cdgiB4xc9ttsmEG3Q11HAQU?= =?us-ascii?Q?X/ucersXsLRGWkLF2E3dFFvWhZ4RSitM3q5Tjd6nfdrPqKXNxx5NdT5z2JpT?= =?us-ascii?Q?2YyxazoZMyj6KLZnHPotjn4bzqPpHI8Ok5BdiiN03plEZhwOoXR4bj249bHj?= =?us-ascii?Q?m7KaPmNt/hAq+zvP0tdof0iD3XxvsxunlmBOkOMd0JY0+zGvl3zCaMIYFRvE?= =?us-ascii?Q?ItoRBVA+jwFjcrk82Ln/Qa4DW3wDjdrjrheRzNVMIcB5P/6MCHzuQVJGXzpA?= =?us-ascii?Q?+GMuj4uJKRGgAtKMPynV+SY3CDJOCZjOycrTnw75RYO8Ad2o55Xw+Myf5N4V?= =?us-ascii?Q?FkXq7teTv2kQXsPXOq6LLL/AEa5rUaByNju+SWzdljEhOwaKha0Um3jb7314?= =?us-ascii?Q?zCpJ8aEjls7qjy5zEpPApGeBO5AQSJNP6/gW1N6cGAu9BwBWjhmlwO59MX56?= =?us-ascii?Q?hrWFJZZP9ogqN3jF7z4vmxzb/soQLM4QZMzrXFtTLYbdMf1RLrLHYZH46DvB?= =?us-ascii?Q?21Rv8GYCPuvbdgLSEfFeuZAUftPiWutIg4a/5IXShapixnVL3CDoS0q8lyPU?= =?us-ascii?Q?IsbNlyC1rP5k0x0eYlKm0qIkep379kP4+MFqatt2OVVI0Hf+AunUCZhJwNcG?= =?us-ascii?Q?xW7OB4qGFzZQyiFHCiK1VYRJcT4cNa+EGKH05Xiq/glvnkS+aRhr0uOtpTsZ?= =?us-ascii?Q?AzMgXZLhLqp0jGIjGbwjutA+T0yGU1dOnSWmB6/5sH0BqLZHYrzIc+1+x9Lc?= =?us-ascii?Q?eRC0bqiimyEZLu2UyVr6q1gX421qZ6LXkKSfgdDuixwiwTqroK8l37S4iKnE?= =?us-ascii?Q?gUeDcPljM1yCn+9tjNGvJlA+IMO/bJ5kg9kwqNLiVfcrRqy8KXzoUmZJkkhN?= =?us-ascii?Q?MdU6x+INjkBF/Ar5FfZLiZesUc2EcceKzrqueu2PMc1KwiEW+2m5NI7rWsYD?= =?us-ascii?Q?YFdNjbsyIWP1G4q6tBqLxNcwKb6Uu0S31GARn66pedrTKXtZPgwL4ZqXzAVf?= =?us-ascii?Q?hJ98W5F8i6u+EOXEdwlG5WrBP3D9Fzu6WHDZJ5urd4Pz2Ox33jjVMWrHwy9U?= =?us-ascii?Q?4dlNhVL4hG3sjaWGGjmiVVUMAcPztmGkiNShrTR0n55Ft05n+nDlgrJIbRhr?= =?us-ascii?Q?ZoXMalHwkbiXH/1P+Zebxx4dhJGPr9jF8DLjXIF7GRY+HxKJxqV4h71Lc7Li?= =?us-ascii?Q?CyJCDZc5mXoMLsftOGYqyJf7TO2U8XWyQfTny2UhIlCsVQ=3D=3D?= X-MS-Exchange-CrossTenant-Network-Message-Id: 45d28c10-5443-4fc5-6e06-08d8f3a5f20f X-MS-Exchange-CrossTenant-AuthSource: BYAPR15MB4136.namprd15.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Mar 2021 18:02:13.1337 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 8ae927fe-1255-47a7-a2af-5f3a069daaa2 X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: xsC4lT7O6m33XzWWQLx9HXiQEnC6IAqyMX+CK3Q/L0wiWKNXxfVbomCMznfMcqIK X-MS-Exchange-Transport-CrossTenantHeadersStamped: BYAPR15MB2278 X-OriginatorOrg: fb.com X-Proofpoint-GUID: _BDmnB9PIBwUFZoarxlJWitrledxnwHa X-Proofpoint-ORIG-GUID: _BDmnB9PIBwUFZoarxlJWitrledxnwHa X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.369,18.0.761 definitions=2021-03-30_08:2021-03-30,2021-03-30 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 priorityscore=1501 lowpriorityscore=0 phishscore=0 impostorscore=0 mlxscore=0 mlxlogscore=629 bulkscore=0 adultscore=0 spamscore=0 suspectscore=0 malwarescore=0 clxscore=1011 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2103250000 definitions=main-2103300131 X-FB-Internal: deliver X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: AF64C90009EC X-Stat-Signature: nhnhtdt6ip7tpdajiu9877z3eotapua5 Received-SPF: none (fb.com>: No applicable sender policy available) receiver=imf19; identity=mailfrom; envelope-from=""; helo=mx0a-00082601.pphosted.com; client-ip=67.231.153.30 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1617127342-530127 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Mar 30, 2021 at 01:24:14PM -0400, Zi Yan wrote: > Hi Roman, > > > On 4 Mar 2021, at 11:45, Roman Gushchin wrote: > > > On Thu, Mar 04, 2021 at 11:26:03AM -0500, Zi Yan wrote: > >> On 1 Mar 2021, at 20:59, Roman Gushchin wrote: > >> > >>> On Wed, Feb 24, 2021 at 05:35:36PM -0500, Zi Yan wrote: > >>>> From: Zi Yan > >>>> > >>>> Hi all, > >>>> > >>>> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29 > >>>> and the code is available at > >>>> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29 > >>>> if you want to give it a try. The actual 49 patches are not sent out with this > >>>> cover letter. :) > >>>> > >>>> Instead of asking for code review, I would like to discuss on the concerns I got > >>>> from previous RFCs. I think there are two major ones: > >>>> > >>>> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA > >>>> regions that are reserved at boot time like hugetlbfs. The concerns on > >>>> using CMA is that an educated guess is needed to avoid depleting kernel > >>>> memory in case CMA regions are set too large. Recently David Rientjes > >>>> proposes to use process_madvise() for hugepage collapse, which is an > >>>> alternative [1] but might not work for 1GB pages, since there is no way of > >>>> _allocating_ a 1GB page to which collapse pages. I proposed a similar > >>>> approach at LSF/MM 2019, generating physically contiguous memory after pages > >>>> are allocated [2], which is usable for 1GB THPs. This approach does in-place > >>>> huge page promotion thus does not require page allocation. > >>> > >>> Well, I don't think there an alternative to cma as now. When the memory is almost > >>> filled at least once, any subsequent activity leading to substantial slab allocations > >>> (e.g. run git gc) will fragment the memory, so that there are virtually no chances > >>> to find a continuous GB. > >>> > >>> It's possible in theory to reduce the fragmentation on 1GB scale by grouping > >>> non-movable pageblocks, but it seems a separate project. > >> > >> My experiments showed that finding continuous GBs is possible, but I agree that > >> CMA is more reliable and 1GB scale defragmentation should be a separate project. > > > > I actually ran a large scale experiment (on tens of thousands of machines) in the last > > several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same. > > Thanks for the information. I finally have time to come back to this. Do you mind sharing > the total memory of these machines? I want to have some idea on the scale of this issue to > make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs, > or TBs memory? There are different configurations, but in general they are in 100's GB or smaller. > > > > > My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory). > > Without cma chances are reaching 0% very fast after reboot, and even manual manipulations > > like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not > > help much. Sometimes you can allocate maybe 1-2 pages, but that's about it. > > Is there a way of replicating such an environment with publicly available software? > I really want to understand the root cause and am willing to find a possible solution. > It would be much easier if I can reproduce this locally. There is nothing fb-specific: once the memory is filled with anon/pagecache, any subsequent allocations of non-movable memory (slabs, percpu, etc) will fragment the memory. There is a pageblock mechanism which prevents the fragmentation on 2MB scale, but nothing prevents the fragmentation on 1GB scale. It just a matter of runtime (and the number of mm operations). > > > > > Even with cma we had to fix a number of additional problems (like sub-optimal placement > > of cma areas, 2MB THP migration, some ext4 and btrfs page migration issues) to have > > a reasonable success rate about ~95-99%. And it's not 100% anyway. > > > > The problem with artificial tests is that you're likely experimenting on a freshly > > rebooted machine which isn't/wasn't doing much. It's a bad model of the real memory > > state of a production server. > > Yes, I agree that my experiment is not representative. Can you provide more information > on what application behavior(s) leading to this memory fragmentation? I guess it is > because non-moveable pages spread across the entire physical memory space. Is there > a quick reproducer for that? I have a simple c program which is able to fragment the memory, you can play with it: https://github.com/rgushchin/fragm . But as I said, basically any load which is actively using the whole memory will fragment it. Thanks!