From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F6ABC54734 for ; Tue, 27 Aug 2024 11:46:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 024396B0085; Tue, 27 Aug 2024 07:46:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F16DD6B0089; Tue, 27 Aug 2024 07:46:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D8FC66B008A; Tue, 27 Aug 2024 07:46:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B3F066B0085 for ; Tue, 27 Aug 2024 07:46:36 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 4352FA785A for ; Tue, 27 Aug 2024 11:46:36 +0000 (UTC) X-FDA: 82497847992.06.305F1AE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf15.hostedemail.com (Postfix) with ESMTP id 0729DA000E for ; Tue, 27 Aug 2024 11:46:32 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SopO8ZMS; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf15.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724759173; a=rsa-sha256; cv=none; b=uCw1acnpW5geSk2thZSx2RhshaeUoRtEqUqmYbi5eg2IVE6cBjolNFapl1m5/TxGdZ972B EiLbxjbFSk8mE5wccozCvg5Da9vJo303BvlCG3DlxKh7+mSX3DfoxTN5XmYjdlQckgEh6h SF5M1DwRBcLVaM0IU+xxMDbaGF60b8Y= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=SopO8ZMS; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf15.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724759173; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RN16f/x7fHfjG9v6mnPSIwPu+Wm1a6yP8pFnDIQtCAY=; b=Rr6r1Hzrs2DGswNsGa2bkTwgBBAz7Nl2w+jXI2lPxppVCVt0DRWl1SDL8yRjhRcygsos/Y RUqnbm6LeyK6eUXEk0gVY91u+kAzw4x/ZQC83wgFQtPygOBCHkwJZ9P2nV20ltXvysuqeU XOoQ2ROD0sowJLnrbq76njjsTRLYRL4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1724759192; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=RN16f/x7fHfjG9v6mnPSIwPu+Wm1a6yP8pFnDIQtCAY=; b=SopO8ZMSNEEe5zTpgZ3fPjHO6BdAogRXTm+Xj7mXtguwPB+yBofo3CXx6CuTzEr5Wxt/C/ olaT7TB2JA/5kPOCzPPVsqmDQfqr/qRfMcMi81IFnTHqu8jOnBTM63tPMMjlCLu9IEm7CT ugV9qkLjsdxEXZEN3plA2wGTpzJYlSU= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-156-EtJOX7AxPU206z7s_X88fw-1; Tue, 27 Aug 2024 07:46:31 -0400 X-MC-Unique: EtJOX7AxPU206z7s_X88fw-1 Received: by mail-wm1-f70.google.com with SMTP id 5b1f17b1804b1-42820c29a76so51646485e9.2 for ; Tue, 27 Aug 2024 04:46:30 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724759190; x=1725363990; h=content-transfer-encoding:in-reply-to:organization:autocrypt :content-language:from:references:cc:to:subject:user-agent :mime-version:date:message-id:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=RN16f/x7fHfjG9v6mnPSIwPu+Wm1a6yP8pFnDIQtCAY=; b=Q0VNruiULKNM2xtm0ajj0t6n5NGxve5X0dwGYxpBfE/hheW63SKEzBPMyYBqqRJGvS gt9nfx69plEv4NbiNO498bCwF4erlecSFE5xPcY3e+mY5QCtWjdKHQUoTpZ5ZyRAfi0V fe2c5wP4sf0QKfiCqbxGm62Fr6ZV9C+/islHBoNWMAmygqeliittY6PuRKbI0+RUs1ro ZYPXGuS9JPXmNgwd6Dgq/7VagixwNfYC4K1tzwZDu553X8SiEG8uHmwPCapuNWoz8wXb mHg+3awtmMUff58z18IKfhaJJ+9R+O7G+nHKeYT46KP34YOvbNXlHNXkeA2dPmy5DRWt UYrA== X-Forwarded-Encrypted: i=1; AJvYcCXtGXw6QOK0D75smZyTZI6lslH9siXMdT060yOjVuIvGB32HyzI5dgUS/ArwAuEJSW8v3PI9HHZNQ==@kvack.org X-Gm-Message-State: AOJu0YxMQojKtAoneHdRfwLwuqWwfQCLkd8SbT2MZ4FM6gvABysJcrL0 9qE8ZLyYYnYTN/3e3ejk5JF4CsfXagaTKgg7QNeApBFbeYfolQQLASo9qrnl7b1xnDp445kJfhN NMMvgf3w95g4FRNbWBGJN+VOyJUF1AnWuWZmE5rB2uMI4K9g9 X-Received: by 2002:a05:600c:511a:b0:426:5fe1:ec7a with SMTP id 5b1f17b1804b1-42b9ae3a992mr16252305e9.31.1724759189758; Tue, 27 Aug 2024 04:46:29 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFbGwbNmIss18jSuVLUNIRVI5JaTKiWjIwWCgAMSH7OHKm+ryI0CzVTn5QsF77U1Ccr7+QJxQ== X-Received: by 2002:a05:600c:511a:b0:426:5fe1:ec7a with SMTP id 5b1f17b1804b1-42b9ae3a992mr16251845e9.31.1724759188732; Tue, 27 Aug 2024 04:46:28 -0700 (PDT) Received: from ?IPV6:2003:cb:c742:a100:9dd2:c523:f5fd:da19? (p200300cbc742a1009dd2c523f5fdda19.dip0.t-ipconnect.de. [2003:cb:c742:a100:9dd2:c523:f5fd:da19]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-37308110603sm12982263f8f.2.2024.08.27.04.46.27 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 27 Aug 2024 04:46:28 -0700 (PDT) Message-ID: Date: Tue, 27 Aug 2024 13:46:26 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 0/2] mm: introduce THP deferred setting To: Johannes Weiner , Usama Arif Cc: Nico Pache , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Andrew Morton , Matthew Wilcox , Barry Song , Ryan Roberts , Baolin Wang , Lance Yang , Peter Xu , Rafael Aquini , Andrea Arcangeli , Jonathan Corbet , "Kirill A . Shutemov" , Zi Yan References: <20240729222727.64319-1-npache@redhat.com> <72320F9D-9B6A-4ABA-9B18-E59B8382A262@nvidia.com> <61411216-d196-42de-aa64-12bd28aef44f@gmail.com> <698ea52e-db99-4d21-9984-ad07038d4068@gmail.com> <20240827110959.GA438928@cmpxchg.org> From: David Hildenbrand Autocrypt: addr=david@redhat.com; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwZgEEwEIAEICGwMGCwkIBwMCBhUIAgkKCwQW AgMBAh4BAheAAhkBFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAl8Ox4kFCRKpKXgACgkQTd4Q 9wD/g1oHcA//a6Tj7SBNjFNM1iNhWUo1lxAja0lpSodSnB2g4FCZ4R61SBR4l/psBL73xktp rDHrx4aSpwkRP6Epu6mLvhlfjmkRG4OynJ5HG1gfv7RJJfnUdUM1z5kdS8JBrOhMJS2c/gPf wv1TGRq2XdMPnfY2o0CxRqpcLkx4vBODvJGl2mQyJF/gPepdDfcT8/PY9BJ7FL6Hrq1gnAo4 3Iv9qV0JiT2wmZciNyYQhmA1V6dyTRiQ4YAc31zOo2IM+xisPzeSHgw3ONY/XhYvfZ9r7W1l pNQdc2G+o4Di9NPFHQQhDw3YTRR1opJaTlRDzxYxzU6ZnUUBghxt9cwUWTpfCktkMZiPSDGd KgQBjnweV2jw9UOTxjb4LXqDjmSNkjDdQUOU69jGMUXgihvo4zhYcMX8F5gWdRtMR7DzW/YE BgVcyxNkMIXoY1aYj6npHYiNQesQlqjU6azjbH70/SXKM5tNRplgW8TNprMDuntdvV9wNkFs 9TyM02V5aWxFfI42+aivc4KEw69SE9KXwC7FSf5wXzuTot97N9Phj/Z3+jx443jo2NR34XgF 89cct7wJMjOF7bBefo0fPPZQuIma0Zym71cP61OP/i11ahNye6HGKfxGCOcs5wW9kRQEk8P9 M/k2wt3mt/fCQnuP/mWutNPt95w9wSsUyATLmtNrwccz63XOwU0EVcufkQEQAOfX3n0g0fZz Bgm/S2zF/kxQKCEKP8ID+Vz8sy2GpDvveBq4H2Y34XWsT1zLJdvqPI4af4ZSMxuerWjXbVWb T6d4odQIG0fKx4F8NccDqbgHeZRNajXeeJ3R7gAzvWvQNLz4piHrO/B4tf8svmRBL0ZB5P5A 2uhdwLU3NZuK22zpNn4is87BPWF8HhY0L5fafgDMOqnf4guJVJPYNPhUFzXUbPqOKOkL8ojk CXxkOFHAbjstSK5Ca3fKquY3rdX3DNo+EL7FvAiw1mUtS+5GeYE+RMnDCsVFm/C7kY8c2d0G NWkB9pJM5+mnIoFNxy7YBcldYATVeOHoY4LyaUWNnAvFYWp08dHWfZo9WCiJMuTfgtH9tc75 7QanMVdPt6fDK8UUXIBLQ2TWr/sQKE9xtFuEmoQGlE1l6bGaDnnMLcYu+Asp3kDT0w4zYGsx 5r6XQVRH4+5N6eHZiaeYtFOujp5n+pjBaQK7wUUjDilPQ5QMzIuCL4YjVoylWiBNknvQWBXS lQCWmavOT9sttGQXdPCC5ynI+1ymZC1ORZKANLnRAb0NH/UCzcsstw2TAkFnMEbo9Zu9w7Kv AxBQXWeXhJI9XQssfrf4Gusdqx8nPEpfOqCtbbwJMATbHyqLt7/oz/5deGuwxgb65pWIzufa N7eop7uh+6bezi+rugUI+w6DABEBAAHCwXwEGAEIACYCGwwWIQQb2cqtc1xMOkYN/MpN3hD3 AP+DWgUCXw7HsgUJEqkpoQAKCRBN3hD3AP+DWrrpD/4qS3dyVRxDcDHIlmguXjC1Q5tZTwNB boaBTPHSy/Nksu0eY7x6HfQJ3xajVH32Ms6t1trDQmPx2iP5+7iDsb7OKAb5eOS8h+BEBDeq 3ecsQDv0fFJOA9ag5O3LLNk+3x3q7e0uo06XMaY7UHS341ozXUUI7wC7iKfoUTv03iO9El5f XpNMx/YrIMduZ2+nd9Di7o5+KIwlb2mAB9sTNHdMrXesX8eBL6T9b+MZJk+mZuPxKNVfEQMQ a5SxUEADIPQTPNvBewdeI80yeOCrN+Zzwy/Mrx9EPeu59Y5vSJOx/z6OUImD/GhX7Xvkt3kq Er5KTrJz3++B6SH9pum9PuoE/k+nntJkNMmQpR4MCBaV/J9gIOPGodDKnjdng+mXliF3Ptu6 3oxc2RCyGzTlxyMwuc2U5Q7KtUNTdDe8T0uE+9b8BLMVQDDfJjqY0VVqSUwImzTDLX9S4g/8 kC4HRcclk8hpyhY2jKGluZO0awwTIMgVEzmTyBphDg/Gx7dZU1Xf8HFuE+UZ5UDHDTnwgv7E th6RC9+WrhDNspZ9fJjKWRbveQgUFCpe1sa77LAw+XFrKmBHXp9ZVIe90RMe2tRL06BGiRZr jPrnvUsUUsjRoRNJjKKA/REq+sAnhkNPPZ/NNMjaZ5b8Tovi8C0tmxiCHaQYqj7G2rgnT0kt WNyWQQ== Organization: Red Hat In-Reply-To: <20240827110959.GA438928@cmpxchg.org> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: 0729DA000E X-Rspamd-Server: rspam01 X-Stat-Signature: e71uwkwx4z88fepfsgfwto1crzfsx9gg X-HE-Tag: 1724759192-490639 X-HE-Meta: U2FsdGVkX1+F90cb0xjVMwB+334gX31h6y8JLDjhMV3wil0lqAJjuP8tpaoDIyj/p9cjDGh/FnQ+mVURuovzhbSIn4kH3BRx+O26hcDM8866447vkVSICaT8WmdZZCCjgAt9bjpH1o+YdT8MPX+E1j9xpOnSP/9FQfyJdRa0IzGbfxcPlHZ/NRoEqbmkVxv38etMef08KSJzbiwf7u25OjPJG+p+np1T8tx3xZKMzkBf+KKYozjg889yRg4MjnZJep+kfgalDn3VGV7Uqtgf4vclVrLyLy8d1v1fD9nn3H2DMQONrFcL+n8l067kYMiVw9SrdSrD9PUEIyuz2SNW5DQMQTfaC+rqWDAzoVqtzbo7pZ6NkE0+hvPDGJ1Ui9lykvZbUpefHVJ46oevs+VhZhfSDw+rnO6MQTX3MCm0wtBPY4RbKPsddJJVwD6HKy3NAUK2nucCcICmMQB+YVUyestKGMhlCuTRlcOAEeDVKeDWqDNwdFVr/EOE/sjoKUXgytiBKumPkiZystJMRBFYD7HzIZlD8efGUslTQgsxIxqJyPOn1XmhTRDD+eLI0bP1R687ab5eKCmGrdcdynntXt6xvc7ri1IYPHeW0NQIdjeou3J570SojqD7NTdncnRiwNPqpshNQ8PJAFuf3rlf1hiPS+kbwxe35RnzMBsGl4PM5nOsaaA2lU2fu2NP5DZ7b6UX7lYM+t8RqF56pj7rYa4jkYUzjmCqiuojDMSne275l3Q3OPkhuD4SI4yk5w/GKHh+uIypzoaDgSwHEdA6+XqwLUMkWGenCVixiNjSRBEmwKGqGK43CI21UUdU4aFkH785ZsPxFHCxCquPowKRB634VKpiYC0GMqeSU2mG+u0UHNnxwVfmRGOmbEXdcMvStCi58fiWRlmKOrNMLWD6rdZk+f+yyFX6QaTLLg+jXc7bIwwS7GA0RA9yNSizOAk1C17k73kpskseVvH4s+S eKY/K42t vLFKJ0tGrvI/FXfZeDXVA6cT1AAn1be1/ZVNNsZQLXslVCDLxroqmph5tN3OPB0j1I4hbhuB/a20nXh9eB1Y5cgZK2siWhto9lfMkXaYiUE1TGByukeFj13q8yZaOOzSP6gBgJ6I+SQ43iBKRnp4wrb0j9n44TQh8YtUbbm/WGQGsF8G8IL4LvKq/vIHWYeKwZdT7SoBKYeZDM9RGhiFMuF+kcW7nGBmwI5kaoyS28/613xScv2xub7urjJcjnzwRpi6qnB/3sdRhhe+VzHoDlp1GaAyyAoACr1RaKrwX5cIrMOwmIxjWWolPn8rYhSdRqTwX0CrkDxFgr/QLAKy/cZviJeQ5QIs3SlfNDeGP/+tro3zQCFfafq8IQgLpr8ObllyujQXmEw361N7l8aT7tQ4k1EKPWjrdllhKsZrR2XWQPm1w6q2vlDi+KuggeKVswj3OW0oXXogTwu50YbMduJE4IYwvdBKBZ2AC X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 27.08.24 13:09, Johannes Weiner wrote: > On Tue, Aug 27, 2024 at 11:37:14AM +0100, Usama Arif wrote: >> >> >> On 26/08/2024 17:14, Nico Pache wrote: >>> On Mon, Aug 26, 2024 at 10:47 AM Usama Arif wrote: >>>> >>>> >>>> >>>> On 26/08/2024 11:40, Nico Pache wrote: >>>>> On Tue, Jul 30, 2024 at 4:37 PM Nico Pache wrote: >>>>>> >>>>>> Hi Zi Yan, >>>>>> On Mon, Jul 29, 2024 at 7:26 PM Zi Yan wrote: >>>>>>> >>>>>>> +Kirill >>>>>>> >>>>>>> On 29 Jul 2024, at 18:27, Nico Pache wrote: >>>>>>> >>>>>>>> We've seen cases were customers switching from RHEL7 to RHEL8 see a >>>>>>>> significant increase in the memory footprint for the same workloads. >>>>>>>> >>>>>>>> Through our investigations we found that a large contributing factor to >>>>>>>> the increase in RSS was an increase in THP usage. >>>>>>> >>>>>>> Any knob is changed from RHEL7 to RHEL8 to cause more THP usage? >>>>>> IIRC, most of the systems tuning is the same. We attributed the >>>>>> increase in THP usage to a combination of improvements in the kernel, >>>>>> and improvements in the libraries (better alignments). That allowed >>>>>> THP allocations to succeed at a higher rate. I can go back and confirm >>>>>> this tomorrow though. >>>>>>> >>>>>>>> >>>>>>>> For workloads like MySQL, or when using allocators like jemalloc, it is >>>>>>>> often recommended to set /transparent_hugepages/enabled=never. This is >>>>>>>> in part due to performance degradations and increased memory waste. >>>>>>>> >>>>>>>> This series introduces enabled=defer, this setting acts as a middle >>>>>>>> ground between always and madvise. If the mapping is MADV_HUGEPAGE, the >>>>>>>> page fault handler will act normally, making a hugepage if possible. If >>>>>>>> the allocation is not MADV_HUGEPAGE, then the page fault handler will >>>>>>>> default to the base size allocation. The caveat is that khugepaged can >>>>>>>> still operate on pages thats not MADV_HUGEPAGE. >>>>>>> >>>>>>> Why? If user does not explicitly want huge page, why bother providing huge >>>>>>> pages? Wouldn't it increase memory footprint? >>>>>> >>>>>> So we have "always", which will always try to allocate a THP when it >>>>>> can. This setting gives good performance in a lot of conditions, but >>>>>> tends to waste memory. Additionally applications DON'T need to be >>>>>> modified to take advantage of THPs. >>>>>> >>>>>> We have "madvise" which will only satisfy allocations that are >>>>>> MADV_HUGEPAGE, this gives you granular control, and a lot of times >>>>>> these madvises come from libraries. Unlike "always" you DO need to >>>>>> modify your application if you want to use THPs. >>>>>> >>>>>> Then we have "never", which of course, never allocates THPs. >>>>>> >>>>>> Ok. back to your question, like "madvise", "defer" gives you the >>>>>> benefits of THPs when you specifically know you want them >>>>>> (madv_hugepage), but also benefits applications that dont specifically >>>>>> ask for them (or cant be modified to ask for them), like "always" >>>>>> does. The applications that dont ask for THPs must wait for khugepaged >>>>>> to get them (avoid insertions at PF time)-- this curbs a lot of memory >>>>>> waste, and gives an increased tunability over "always". Another added >>>>>> benefit is that khugepaged will most likely not operate on short lived >>>>>> allocations, meaning that only longstanding memory will be collapsed >>>>>> to THPs. >>>>>> >>>>>> The memory waste can be tuned with max_ptes_none... lets say you want >>>>>> ~90% of your PMD to be full before collapsing into a huge page. simply >>>>>> set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the >>>>>> 512 pages to be present before being collapsed. >>>>>> >>>>>>> >>>>>>>> >>>>>>>> This allows for two things... one, applications specifically designed to >>>>>>>> use hugepages will get them, and two, applications that don't use >>>>>>>> hugepages can still benefit from them without aggressively inserting >>>>>>>> THPs at every possible chance. This curbs the memory waste, and defers >>>>>>>> the use of hugepages to khugepaged. Khugepaged can then scan the memory >>>>>>>> for eligible collapsing. >>>>>>> >>>>>>> khugepaged would replace application memory with huge pages without specific >>>>>>> goal. Why not use a user space agent with process_madvise() to collapse >>>>>>> huge pages? Admin might have more knobs to tweak than khugepaged. >>>>>> >>>>>> The benefits of "always" are that no userspace agent is needed, and >>>>>> applications dont have to be modified to use madvise(MADV_HUGEPAGE) to >>>>>> benefit from THPs. This setting hopes to gain some of the same >>>>>> benefits without the significant waste of memory and an increased >>>>>> tunability. >>>>>> >>>>>> future changes I have in the works are to make khugepaged more >>>>>> "smart". Moving it away from the round robin fashion it currently >>>>>> operates in, to instead make smart and informed decisions of what >>>>>> memory to collapse (and potentially split). >>>>>> >>>>>> Hopefully that helped explain the motivation for this new setting! >>>>> >>>>> Any last comments before I resend this? >>>>> >>>>> Ive been made aware of >>>>> https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u >>>>> which introduces THP splitting. These are both trying to achieve the >>>>> same thing through different means. Our approach leverages khugepaged >>>>> to promote pages, while Usama's uses the reclaim path to demote >>>>> hugepages and shrink the underlying memory. >>>>> >>>>> I will leave it up to reviewers to determine which is better; However, >>>>> we can't have both, as we'd be introducing trashing conditions. >>>>> >>>> >>>> Hi, >>>> >>>> Just inserting this here from my cover letter: >>>> >>>> Waiting for khugepaged to scan memory and >>>> collapse pages into THP can be slow and unpredictable in terms of performance >>> Obviously not part of my patchset here, but I have been testing some >>> changes to khugepaged to make it more aware of what processes are hot. >>> Ideally then it can make better choices of what to operate on. >>>> (i.e. you dont know when the collapse will happen), while production >>>> environments require predictable performance. If there is enough memory >>>> available, its better for both performance and predictability to have >>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged >>>> to collapse it, and deal with sparsely populated THPs when the system is >>>> running out of memory. >>>> >>>> I just went through your patches, and am not sure why we can't have both? >>> Fair point, we can. I've been playing around with splitting hugepages >>> and via khugepaged and was thinking of the trashing conditions there-- >>> but your implementation takes a different approach. >>> I've been working on performance testing my "defer" changes, once I >>> find the appropriate workloads I'll try adding your changes to the >>> mix. I have a feeling my approach is better for latency sensitive >>> workloads, while yours is better for throughput, but let me find a way >>> to confirm that. >>> >>> >> Hmm, I am not sure if its latency vs throughput. >> >> There are 2 things we probably want to consider, short lived and long lived mappings, and >> in each of these situations, having enough memory and running out of memory. >> >> For short lived mappings, I believe reducing page faults is a bigger factor in >> improving performance. In that case, khugepaged won't have enough time to work, >> so THP=always will perform better than THP=defer. THP=defer in this case will perform >> the same as THP=madvise? >> If there is enough memory, then the changes I introduced in the shrinker won't cost anything >> as the shrinker won't run, and the system performance will be the same as THP=always. >> If there is low memory and the shrinker runs, it will only split THPs that have zero-filled >> pages more than max_ptes_none, and map the zero-filled pages to shared zero-pages saving memory. >> There is ofcourse a cost to splitting and running the shrinker, but hopefully it only splits >> underused THPs. >> >> For long lived mappings, reduced TLB misses would be the bigger factor in improving performance. >> For the initial run of the application THP=always will perform better wrt TLB misses as >> page fault handler will give THPs from start. >> Later on in the run, the memory might look similar between THP=always with shrinker and >> max_ptes_none < HPAGE_PMD_NR vs THP=defer and max_ptes_none < HPAGE_PMD_NR? >> This is because khugepaged will have collapsed pages that might have initially been faulted in. >> And collapsing has a cost, which would not have been incurred if the THPs were present from fault. >> If there is low memory, then shrinker would split memory (which has a cost as well) and the system >> memory would look similar or better than THP=defer, as the shrinker would split THPs that initially >> might not have been underused, but are underused at time of memory pressure. >> >> With THP=always + underused shrinker, the cost (splitting) is incurred only if needed and when its needed. >> While with THP=defer the cost (higher page faults, higher TLB misses + khugepaged collapse) is incurred all the time, >> even if the system might have plenty of memory available and there is no need to take a performance hit. > > I agree with this. The defer mode is an improvement over the upstream > status quo, no doubt. However, both defer mode and the shrinker solve > the issue of memory waste under pressure, while the shrinker permits > more desirable behavior when memory is abundant. > > So my take is that the shrinker is the way to go, and I don't see a > bonafide usecase for defer mode that the shrinker couldn't cover. Page fault latency? IOW, zeroing a complete THP, which might be up to 512 MiB on arm64. This is one of the things people bring up, where FreeBSD is different because it will zero fragments on-demand (but also result in more pagefaults). On the downside, in the past (before) we could easily and repeatedly fail to collapse THPs in busy environments. With per-VMA locks this might have improved in the meantime. -- Cheers, David / dhildenb