From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51BE8C25B75 for ; Wed, 15 May 2024 21:41:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 97B466B039F; Wed, 15 May 2024 17:41:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 92BF46B03A0; Wed, 15 May 2024 17:41:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F30D6B03A2; Wed, 15 May 2024 17:41:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5D7436B039F for ; Wed, 15 May 2024 17:41:41 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id A9E82A16A0 for ; Wed, 15 May 2024 21:41:40 +0000 (UTC) X-FDA: 82121952360.28.47003C9 Received: from mail-lf1-f49.google.com (mail-lf1-f49.google.com [209.85.167.49]) by imf30.hostedemail.com (Postfix) with ESMTP id C13E080010 for ; Wed, 15 May 2024 21:41:38 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=e4wagVke; spf=pass (imf30.hostedemail.com: domain of shy828301@gmail.com designates 209.85.167.49 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715809298; a=rsa-sha256; cv=none; b=5/Jlu7OT6Bpubt2DR1ftcQHF/FUNdxnzOeWAMtNZPT6nVZvDzEF4P8oq3S71zdyUeRAvgo 13971yQfXiBxqei3KG3WrfkISBwS0ZXSy2aGNIsWDJCNpFLkWmmkNQtNP69KGsvccgovQI VjuwniYGTYHfQH/RafnApBPnRyfpojk= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=e4wagVke; spf=pass (imf30.hostedemail.com: domain of shy828301@gmail.com designates 209.85.167.49 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715809298; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nfIV1euC2nlo2z8y4VTpQ7Qunyun/mXYMbYqRLMzUdw=; b=7JhdY11ekZewlArBCrWetCfcDDO0j6BZZunZHeDsejl8juXp9Bcxrhy8stpyRCwooP29Y/ W5RKQuv1qDDMCgm0n7b38Fr/loYH21wEy5oCylx6g71uMBOjqdor63vhyOZkurVZrKx5ZP I9BcheOR10kPD0DSnZwfRUJXw/8rWQM= Received: by mail-lf1-f49.google.com with SMTP id 2adb3069b0e04-52232d0e5ceso81841e87.0 for ; Wed, 15 May 2024 14:41:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1715809297; x=1716414097; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nfIV1euC2nlo2z8y4VTpQ7Qunyun/mXYMbYqRLMzUdw=; b=e4wagVkelzN54rSeHqwFyyYFS2WgSydFeOZUtzdDnFjT9z+EJJffotcLLVtZS/qF6G juqg8CAelwUEalTZd4QMoCnG68jZtugOjTksRuiojUdxpj0Tsip0TP3jwRTqewcMorfg q4YBYsjrAm67di6VUTk28NS5nh3eI9ny/Hn2gF3qZAciYfnRA6uj0X0I/jpjautU8ocT 7EX/0IJwI6k2Kk/l4hVqL12Kb6aIFWMqwnFWs8E1SrFH9BxsIh7akbaxLsK0xYIuGcku dVgzkcUg1Ddsq0FVFdx0WF2kX2AG7KQiaanaAkmU/cRWDNMDLhp5Vn0atuAQVaPLrgjk DmLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715809297; x=1716414097; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nfIV1euC2nlo2z8y4VTpQ7Qunyun/mXYMbYqRLMzUdw=; b=MtyeOpFSlQ/aDzoj8PHhsC80KcTsW90bdxhQ68I+Ytjeaab1bp91xrXcxYmhcaw+qM /vu7oaG7ut+Jk0oUQS2bL/kO+3ZPbNlSjo+r78Od96sTUWAAQLjlym1aBPA2u6u14XhC poRNmgP5eE+zNw05BLUWr5/KIiJph+xpHl30Eeo3BqzOtoOYlPdutfx1Nbf/62ZE8j0t PKitlTCcJ6+zI1hLr4R41+Hon5ZJO4rLke/jhfNG47cqtaJBxhmmDk6XxNv8FdqUc67L XynnwBJwEgxsSMXwEz8xnq7f++A1LvU6q6Y8U1ORvhqZXON7sn9uNb+94deagHy+rUn1 zyqg== X-Forwarded-Encrypted: i=1; AJvYcCV8wZXls61GQ8j+dhjXdBN8bYjpsoDEU/cTmF7NJ3E4tSKK2dGmZjVHCNobxeygiLpgUl5gP3tcuZIEneH3zDBz3kc= X-Gm-Message-State: AOJu0Ywwpv6XiOohrLKVwOqWDb/9yUJ7QwxdOgSCl76p5eukFQiZy63S dwG8bCCI65SJlO/lyzbD8HxArfj9H4ywigUs/w6dVsDretNr8jzYqTapGcMmYULWqur6dhkX5jS Rr0o/CdBkhv7D67fdnE+BIxyiEgY= X-Google-Smtp-Source: AGHT+IHbfYO/Te8j/EE4BokX4Z4V5a9WGfPfGJw7tF8bd97zvgg7f3EaE2Ftj4X5JKn+apAwy6ZV7FECXSBENCR7l+w= X-Received: by 2002:a05:6512:224a:b0:523:93e8:1ced with SMTP id 2adb3069b0e04-52393e820f4mr2148113e87.53.1715809296791; Wed, 15 May 2024 14:41:36 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yang Shi Date: Wed, 15 May 2024 15:41:25 -0600 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation To: Barry Song <21cnbao@gmail.com> Cc: lsf-pc@lists.linux-foundation.org, Linux-MM Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: C13E080010 X-Stat-Signature: 7wpbux6dmrz3jeozm44yprto7bce3s8d X-HE-Tag: 1715809298-183107 X-HE-Meta: U2FsdGVkX18phDn7EkUmxjZyAGTWX40iMxXidpGCQuotTMSpLr17jXVoP2rN8mcelSrVyodrPjWJHWdo+/6c6Kd5ioOI8RJ8Dmq9pJl5v8fWSc2tMLZrAneIQNAT5YineNI4Z9Nbav6SW94aUo7ZlMmYWsRIW4zi/cBlOAKYCazc6LsKY1hpWSGI5d++6lGBW1SCXsQIKQaua07vkPvtc9yzNSelm+Hq2/uzhtsusdZaV0eLu2sTLurxLHS18K58wVLkjrq/JxNWpwgzLq3FpIWX0P+OndPVdcYIy7unxTrpiFUYABxIUD5Rz/+DP5nbKCyz9Co7CwHe/tpsA93WlWPZ7wlMqWX81C8vV/ngtyGJQunPqgNb5kOWElgyX90IT+mzMp6UilNGIQYP7DJD6T+wa6FGNzyS7o7bhWy+shNb7GuT8ebyDjVzY1Tt0vfecfqBl1C8PQFKoSCOHfv4ipVG3hFAV3suBULaenBTKBlUpUVPsQMSA+4g3/5pYCA8IzhkYw2k5LKqU/PNHBDUWtxag8xdKHnxb0ss1yC/diolxzh1KCd2T2G4tkUNF9cHHKStifR5Q/bXvBIL1pf6swEu6g5dWDnKmbhQTaNRLJaq1t+V93AAXIsPqR7QfpcR3CdheEf2bXr9AydYaJW+DqV/Hwu1y7X9Hb+NW/xMSuWKSLuwXTBO+ySQcYnpkJAUoIz3jMoTEBqTzd1l/oLQVMA89eJKV7m1aEvuwQ3HtyeBp0Qou/yKdxig9MKsiNfbo+FppjIuqwuXa8doJqWCgHLEM5bADHE05Ko2IcoeCZRwElqNDCygn0NNwAaqPQNjxdzjZzchqNAXda9wbep3GRyYnInWRGehb95pDsxc6+lLIL4476iY9tkmHxdRSJtaHZtvowi+qO/gUFkBVZXnntluuWv0IHF4QvfcGOfXZLZyuwTaMALrW/7pyEvNzrXCniIPfwRQf3JlbxAjoQ0 2NeLA5t3 oClx+3yqGYFLhAHezfleUbj5LTC7ZUhL+rDEuLcUkGx3hJaFARldFA1wvIdLztAnVWD/oY3b6t25lYNPQ0he58gIvf4dPfqrjaN9Rh7VAga6ZE+jISqFbtfDhgA5+CELEzv9kYu4pEyviMuTDGf0xrIHsZ1X+atFCX3U7EDn8BX6A20ZI4Ko5FwmGvGAGUvZC1dVigJ5r4uGGIGZ/aRI5BbO3O8b3JmBaIlsT6qHz5e0QxYSgkCDjlMFdBqdVSGAGVfgJjnWOzQQhzfZRsJXwz9mpPVtpfz8bdPobA5ri3OLQlAqCDqBp4i6vzTraYL579/tvl0NicqU146HM/duuNMm5ASssIbRu7G8Y/YuPRSfERYcXeOF8zlH3UA04mlvA6bAWt7GkpA+5F4YGWBRMiIsDntFaryvsf44iOcc/A0nmMUYnE9KReR/8h1CIvd1Fudef7ooOS2LrnPg+aAcIKLK3zVQCUO6UZrqLjuer8CfeNm75YSnwQ6gZ1Ex+J4mbtD000nTP4qNFjXZVQI9UJ9OMpDf0I3jjoKOGai6l3DsW+/aeXqzGVYgOscPbLp4luKck5t/rdvKodTs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.001367, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, May 15, 2024 at 1:25=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrot= e: > > On Thu, May 16, 2024 at 1:49=E2=80=AFAM Yang Shi wr= ote: > > > > On Tue, May 14, 2024 at 3:20=E2=80=AFAM Barry Song <21cnbao@gmail.com> = wrote: > > > > > > On Sat, May 11, 2024 at 9:18=E2=80=AFAM Yang Shi wrote: > > > > > > > > On Thu, May 9, 2024 at 7:22=E2=80=AFPM Barry Song <21cnbao@gmail.co= m> wrote: > > > > > > > > > > Hi, > > > > > > > > > > I'd like to propose a session about the allocation and reclamatio= n of > > > > > mTHP. This is related to Yu Zhao's > > > > > TAO[1] but not the same. > > > > > > > > > > OPPO has implemented mTHP-like large folios across thousands of > > > > > genuine Android devices, utilizing > > > > > ARM64 CONT-PTE. However, we've encountered challenges: > > > > > > > > > > - The allocation of mTHP isn't consistently reliable; even after > > > > > prolonged use, obtaining large folios > > > > > remains uncertain. > > > > > As an instance, following a few hours of operation, the likelih= ood > > > > > of successfully allocating large > > > > > folios on a phone may decrease to just 2%. > > > > > > > > > > - Mixing large and small folios in the same LRU list can lead to > > > > > mutual blocking and unpredictable > > > > > latency during reclamation/allocation. > > > > > > > > I'm also curious how much large folios can improve reclamation > > > > efficiency. Having large folios is supposed to reduce the scan time > > > > since there should be fewer folios on LRU. But IIRC I haven't seen = too > > > > much data or benchmark (particularly real life workloads) regarding > > > > this. > > > > > > Hi Yang, > > > > > > We lack direct data on this matter, but information from Ryan's THP_S= WPOUT > > > series [1] provides insights as follows: > > > > > > | alloc size | baseline | + this series | > > > | | mm-unstable (~v6.9-rc1) | | > > > |:-----------|------------------------:|------------------------:| > > > | 4K Page | 0.0% | 1.3% | > > > | 64K THP | -13.6% | 46.3% | > > > | 2M THP | 91.4% | 89.6% | > > > > > > > > > I suspect the -13.6% performance decrease is due to the split > > > operation. Once the split > > > is eliminated, the patchset observed a 46.3% increase. It is presumed > > > that the overhead > > > required to reclaim 64K is reduced compared to reclaiming 16 * 4K. > > > > Thank you. Actually I care about 4k vs 64k vs 256k ... > > > > I did a simple test by calling MADV_PAGEOUT on 4G memory w/ the > > swapout optimization then measured the time spent in madvise, I can > > see the time was reduced by ~23% between 64k vs 4k. Then there is no > > noticeable reduction between 64k and larger sizes. > > If you engage in perf analysis, what observations can you make? I suspect= that > even with larger folios, the function try_to_unmap_one() continues to ite= rate > through PTEs individually. Yes, I think so. > If we're able to batch the unmapping process for the entire folio, we mig= ht > observe improved performance. I did profiling to my benchmark, I didn't see try_to_unmap showed as hot spot. The time is actually spent in zram I/O. But batching try_to_unmap() may show some improvement. Did you do it in your kernel? It should be worth exploring. > > > > > Actually I saw such a pattern (performance doesn't scale with page > > size after 64K) with some real life workload benchmark. I'm going to > > talk about it in today's LSF/MM. > > > > > > > > However, at present, in actual android devices, we are observing > > > nearly 100% occurrence > > > of anon_thp_swpout_fallback after the device has been in operation fo= r > > > several hours[2]. > > > > > > Hence, it is likely that we will experience regression instead of > > > improvement due to the > > > absence of measures to mitigate swap fragmentation. > > > > > > [1] https://lore.kernel.org/all/20240408183946.2991168-1-ryan.roberts= @arm.com/ > > > [2] https://lore.kernel.org/lkml/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhg= MQdSMp+Ah+NSgNQ@mail.gmail.com/ > > > > > > > > > > > > > > > > > For instance, if you require large folios, the LRU list's tail = could > > > > > be filled with small folios. > > > > > LRU(LF- large folio, SF- small folio): > > > > > > > > > > LF - LF - LF - SF - SF - SF - SF - SF - SF -SF - SF - SF - SF= - SF - SF - SF > > > > > > > > > > You might end up reclaiming many small folios yet still struggle= to > > > > > allocate large folios. Conversely, > > > > > the inverse scenario can occur when the LRU list's tail is popul= ated > > > > > with large folios. > > > > > > > > > > SF - SF - SF - LF - LF - LF - LF - LF - LF -LF - LF - LF - LF= - LF - LF - LF > > > > > > > > > > In OPPO's products, we allocate dedicated pageblocks solely for l= arge > > > > > folios allocation, and we've > > > > > fine-tuned the LRU mechanism to support dual LRU=E2=80=94one for = small folios > > > > > and another for large ones. > > > > > Dedicated page blocks offer a fundamental guarantee of allocating > > > > > large folios. Additionally, segregating > > > > > small and large folios into two LRUs ensures that both can be > > > > > efficiently reclaimed for their respective > > > > > users' requests. However, while the implementation may lack aest= hetic > > > > > appeal and is primarily tailored > > > > > for product purposes, it isn't fully upstreamable. > > > > > > > > > > You can obtain the architectural diagram of OPPO's approach from = link[2]. > > > > > > > > > > Therefore, my plan is to present: > > > > > > > > > > - Introduce the architecture of OPPO's mTHP-like approach, which > > > > > encompasses additional optimizations > > > > > we've made to address swap fragmentation issues and improve swa= p > > > > > performance, such as dual-zRAM > > > > > and compression/decompression of large folios [3]. > > > > > > > > > > - Present OPPO's method of utilizing dedicated page blocks and a > > > > > dual-LRU system for mTHP. > > > > > > > > > > - Share our observations from employing Yu Zhao's TAO on Pixel 6 = phones. > > > > > > > > > > - Discuss our future direction=E2=80=94are we leaning towards TAO= or dedicated > > > > > page blocks? If we opt for page > > > > > blocks, how do we plan to resolve the LRU issue? > > > > > > > > > > [1] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuz= hao@google.com/ > > > > > [2] https://github.com/21cnbao/mTHP/blob/main/largefoliosarch.png > > > > > [3] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnb= ao@gmail.com/ > > > > > > > Thanks, > Barry