From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F338C25B75 for ; Wed, 15 May 2024 22:15:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C31AD6B00AC; Wed, 15 May 2024 18:15:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BE1C66B00EC; Wed, 15 May 2024 18:15:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AA93C6B00F7; Wed, 15 May 2024 18:15:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 886C06B00AC for ; Wed, 15 May 2024 18:15:27 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 35BAB40537 for ; Wed, 15 May 2024 22:15:27 +0000 (UTC) X-FDA: 82122037494.13.DBE78A3 Received: from mail-vs1-f48.google.com (mail-vs1-f48.google.com [209.85.217.48]) by imf15.hostedemail.com (Postfix) with ESMTP id 6F5A3A0011 for ; Wed, 15 May 2024 22:15:25 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LdlwYmKY; spf=pass (imf15.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715811325; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8RKtr68U3buzgvpZGKAPMQIjXGmFSacre9RpJrWu4MY=; b=ny0h+PMruE6R2Lt2jytxiEeJIXS+RlfFcODSdEGmt9Y8BupXHFQBRDG65j1icZksRPT09e bJxIORTZ+nHnVLddyd7Uwq+LFMI6j2Q/L2BgQFGzyQO5mfdas4rxJWxRYxX3EH95lWYjGA LT1/49sYgRLOQ20ZPuvcBUtPzWTQMeY= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LdlwYmKY; spf=pass (imf15.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715811325; a=rsa-sha256; cv=none; b=tvqnUBhRdQKPQuQ/BqYB5MQw/mvPIlKVE2rI/9UOx99rOaB9p+q4vx2aXSy2EZ5Wi1TJ6P TMtLnGr3RB9RZ3G+G9i7eCqj6l2A5ai8n6QUWI8orojrp8KicuLrBdhY9IQKvsBNlYq+wR 5f8Jq7IkWRtFX6xDN71liczjd0Zwtq8= Received: by mail-vs1-f48.google.com with SMTP id ada2fe7eead31-47eefa04398so2180666137.3 for ; Wed, 15 May 2024 15:15:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1715811324; x=1716416124; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=8RKtr68U3buzgvpZGKAPMQIjXGmFSacre9RpJrWu4MY=; b=LdlwYmKYZRtlIxfc99wLT5fOguJ30D5PkEilk/zEdT9HkVAcu7uts3Vk/b6kMpY85X wNEWrs1TFjUlEuUhqun1bT6YPOJ+SnBNkd16Eqr6gbAtTBazyUFcJScIoAkMw13Mpt7X gsryEJmD1CGjbdZwsPG0+bNNku9h84fkAMYn/DNMiP9sURuK8jXiSjEykYpJYaJdsQtE WezRFKYiL7ewYXuHYEay2qrDA1uOygswO86oIIMkTW4ToyQGIKNE4I9zIb2oZC72QwWJ Wlx6HelhCjX72udncbigPaPOqrCBSGBQaNqsjsKrkg+wUQIl+tqALw2mLySP8cmBY9Gl FXfA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715811324; x=1716416124; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8RKtr68U3buzgvpZGKAPMQIjXGmFSacre9RpJrWu4MY=; b=Wm/ArSk+F+OY3nm5XZx3dcUeDSZFcCEnHbYgNu2oIYtaSU2Z7D63kczVdboKmpMAal uqA6LQ2f054CSMaxUAgAVax/8Q5aTUXIPvz1QBBlnLlDUyGNJTE5ubfhKR6Zcza3x8Um XbI61LISAQ7Xc7IuB7844btHaSwoI1l8rAeXL47FDDXmMIDPtXW01Q5DQaQ8x8Vu2EuI mjygwykY45cW9BxjlMQqh0JDebFmpqwNO6fK+Rj1xKrciA2ERzkaHY8df+rWl/BXa9jV mfyfp0eqSPJWW2bqt+/kpfHTeIOKfFXzzv4Rzg7SD4Bj64tZVbeZ3WjIYuoSxTyfl8jO 3Kfg== X-Forwarded-Encrypted: i=1; AJvYcCWw9X0V30azkSggXp+18Y0jq8C1L1z5vgd8d10BiZ5s0FVJN9Xkfl7pHrxVDCzHBbIDvn80CL7O4Z6AzBe3SFgSlmY= X-Gm-Message-State: AOJu0YxjrVVJx5Yf3horlU+Pk9mKnCVcTU5gc4p+UGQR1yP4jsK1xpvY CUJB1tQjH7Y1T2j85kxyXb5BvlbIMtse7wWzb/9/jEZzMAuZE/RE4XgSyC3lvGkfFLVZNTmSDt7 td25fJlcUDkB55fL+oG9MHvpI8vE= X-Google-Smtp-Source: AGHT+IGYoDJJ3LxXBV2pdaLRM/xoZWbP3gF/2651eCm9GGLu5maNWJJ7B+woHoWAnoF9dI7VwV0Y+7mRw3WFm2Ba+vM= X-Received: by 2002:a05:6102:cca:b0:47e:f02f:83f2 with SMTP id ada2fe7eead31-48077e82267mr15751392137.24.1715811324387; Wed, 15 May 2024 15:15:24 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Thu, 16 May 2024 10:15:13 +1200 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation To: Yang Shi Cc: lsf-pc@lists.linux-foundation.org, Linux-MM Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: rqn91imfnaahr6ujh8xsewgx1kcsqz74 X-Rspamd-Queue-Id: 6F5A3A0011 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1715811325-427369 X-HE-Meta: U2FsdGVkX18j9m0ZKO+M5M+Dg9klUJq96aGu64Pfylhfu62uhX6kEWOmfRvcm7X5noS+rA3t4ju2kwkW7MR1z0TzIa0SgjpmSzwpGdEAFyuyOtXJaF3PBX0yVyx5dWfgMoMO1C7OwVU1JQSqU2SvnPV5cLJUKvlpYMJSh6dbdfJ1ousMPZR5VfCbyFSAOhViYFkLir1Tf6HTCO91zN8cX+qXkeH01an7jonVlHI64ezhY4+4h74WV2CT7WmD4kZRuCgOATKr4wvyUe/jQAa9dcxxav86pLfXjiN6jX8BGA6N+grsX60tA66L7SyVuAp91ZJNxt7kI5jRHCXO8DOJdg+mOdNQYcbHOh/1vUwDY/NekmdStAIcS61qOpi62Da8g2JPF2Lq7Hzb0HWIR+bAoCZ1d0WFXtPazO+U/4nPxG1pgZpE4UyEjtDSckAyTy2zQABtoHhUy+zLRdkjG1RLb8c7tvZhIxxb/t24QycIVDTO/JLuroY5vg1hnWLjV30ow0L4Fsc29vPdhKY7k/RZzPIRK/mwV3HH183aIYHHyPBURmdu88uxsgf3USdf9aiy56Hkf7Fx9XD/8kZhPZmyoiXHqPH50O7jDL5p3htdHYRDLR/vL7AYg774z/lpvB8xJqviMlLTcPgFQV3pKyf0WOo+EDRRSy6+oScfnh422Cd1DyVznS1O2iGstDnnqMJBhWoV1SUQsRNwoqG/WZfutW6HhsFp5fI6jaYSRvw8Dd0UViBIzmK9JAnj23/YRkcSgw36zf40jyaLizDKjsbtvGaLvKavNyFufwp58TYRGSPNNc6/Zc8WuA6nzm+NplqABc3YNJDhzFyQM7k18GwwIcyCkz3eo8PQymmaZDfJHyJG8LpiZ28SIHFeK05AyEtNCrYO8Vef1Syrcio6ts2hLh4I5Leasi2UlfCajZLI3QGlgmXztVJl8dlaVSWSWsYQ8y9GrShE5MCSxZLx8x9 qrDxB3AP IBryieaP3QF2ujMk9Y3pB+kYPQapgk3G9WXTyWRXYyiiR88ig+otedbGf7WX1XdtOyQZ7+fKfDoTSryCwjPmpP8iEpoWshZd780VKarTPlc1z1vLVgk62gp23hV4iLwKvTBLjxS/ns9MHWtwYTAP4OvTTYqN8XT3Dz3x2i4OCsdAkdWCMrE5gd6algkUgAwilFsr82p0UFy8XcdCacxGNSWfpjF3sx/9KqqoRcX+pWNjalooFDsmmM3xPxY4sMVCahuuBiw4TRzFP1PnTRKdDace5qyOz0mryS5gihpM5M619c+EjDXkK9EkW4pIiPc/ZAEAX9bdGoASwCGU+S6DUpqT0V8L8GVD3beT2rFkvfOu4nLkCoZ3ZR5Z1GzJJAU5eJM4yR+AzYXyaevEumlU0xqT+9Y0Riqu4DhuenVa9KfGi48dxpLJb0ujdKzQ40Zp1RkLQROkjnQJN1GU5e1Ee2ywT4YNpTgNJL4J5QkI4uMmnjOVXyu7fFxHKT9raakO7fUGFuedVEdSKuLMWqx0vP99NRx19JpNci589tv3bW2ytfGVxPY8tvpHV3TTkthUtWrQxQE1cxqQtT7A= X-Bogosity: Ham, tests=bogofilter, spamicity=0.001161, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, May 16, 2024 at 9:41=E2=80=AFAM Yang Shi wrot= e: > > On Wed, May 15, 2024 at 1:25=E2=80=AFPM Barry Song <21cnbao@gmail.com> wr= ote: > > > > On Thu, May 16, 2024 at 1:49=E2=80=AFAM Yang Shi = wrote: > > > > > > On Tue, May 14, 2024 at 3:20=E2=80=AFAM Barry Song <21cnbao@gmail.com= > wrote: > > > > > > > > On Sat, May 11, 2024 at 9:18=E2=80=AFAM Yang Shi wrote: > > > > > > > > > > On Thu, May 9, 2024 at 7:22=E2=80=AFPM Barry Song <21cnbao@gmail.= com> wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > I'd like to propose a session about the allocation and reclamat= ion of > > > > > > mTHP. This is related to Yu Zhao's > > > > > > TAO[1] but not the same. > > > > > > > > > > > > OPPO has implemented mTHP-like large folios across thousands of > > > > > > genuine Android devices, utilizing > > > > > > ARM64 CONT-PTE. However, we've encountered challenges: > > > > > > > > > > > > - The allocation of mTHP isn't consistently reliable; even afte= r > > > > > > prolonged use, obtaining large folios > > > > > > remains uncertain. > > > > > > As an instance, following a few hours of operation, the likel= ihood > > > > > > of successfully allocating large > > > > > > folios on a phone may decrease to just 2%. > > > > > > > > > > > > - Mixing large and small folios in the same LRU list can lead t= o > > > > > > mutual blocking and unpredictable > > > > > > latency during reclamation/allocation. > > > > > > > > > > I'm also curious how much large folios can improve reclamation > > > > > efficiency. Having large folios is supposed to reduce the scan ti= me > > > > > since there should be fewer folios on LRU. But IIRC I haven't see= n too > > > > > much data or benchmark (particularly real life workloads) regardi= ng > > > > > this. > > > > > > > > Hi Yang, > > > > > > > > We lack direct data on this matter, but information from Ryan's THP= _SWPOUT > > > > series [1] provides insights as follows: > > > > > > > > | alloc size | baseline | + this series | > > > > | | mm-unstable (~v6.9-rc1) | | > > > > |:-----------|------------------------:|------------------------:| > > > > | 4K Page | 0.0% | 1.3% | > > > > | 64K THP | -13.6% | 46.3% | > > > > | 2M THP | 91.4% | 89.6% | > > > > > > > > > > > > I suspect the -13.6% performance decrease is due to the split > > > > operation. Once the split > > > > is eliminated, the patchset observed a 46.3% increase. It is presum= ed > > > > that the overhead > > > > required to reclaim 64K is reduced compared to reclaiming 16 * 4K. > > > > > > Thank you. Actually I care about 4k vs 64k vs 256k ... > > > > > > I did a simple test by calling MADV_PAGEOUT on 4G memory w/ the > > > swapout optimization then measured the time spent in madvise, I can > > > see the time was reduced by ~23% between 64k vs 4k. Then there is no > > > noticeable reduction between 64k and larger sizes. > > > > If you engage in perf analysis, what observations can you make? I suspe= ct that > > even with larger folios, the function try_to_unmap_one() continues to i= terate > > through PTEs individually. > > Yes, I think so. > > > If we're able to batch the unmapping process for the entire folio, we m= ight > > observe improved performance. > > I did profiling to my benchmark, I didn't see try_to_unmap showed as > hot spot. The time is actually spent in zram I/O. > > But batching try_to_unmap() may show some improvement. Did you do it > in your kernel? It should be worth exploring. Not at the moment. However, we've experimented with compressing large folios in larger granularities, like 64KiB [1]. This experimentation has yi= elded significant enhancements in CPU utilization reduction and compression rates= . You can adjust the granularity through the ZSMALLOC_MULTI_PAGES_ORDER setting, with the default value being 4. Without our patch, zRAM compresses large folios in 4KiB granularity by iter= ating each subpage. [1] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.c= om/ > > > > > > > > > Actually I saw such a pattern (performance doesn't scale with page > > > size after 64K) with some real life workload benchmark. I'm going to > > > talk about it in today's LSF/MM. > > > > > > > > > > > However, at present, in actual android devices, we are observing > > > > nearly 100% occurrence > > > > of anon_thp_swpout_fallback after the device has been in operation = for > > > > several hours[2]. > > > > > > > > Hence, it is likely that we will experience regression instead of > > > > improvement due to the > > > > absence of measures to mitigate swap fragmentation. > > > > > > > > [1] https://lore.kernel.org/all/20240408183946.2991168-1-ryan.rober= ts@arm.com/ > > > > [2] https://lore.kernel.org/lkml/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJ= hgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > > > > > > > > > > > > > > > > > > > > For instance, if you require large folios, the LRU list's tai= l could > > > > > > be filled with small folios. > > > > > > LRU(LF- large folio, SF- small folio): > > > > > > > > > > > > LF - LF - LF - SF - SF - SF - SF - SF - SF -SF - SF - SF - = SF - SF - SF - SF > > > > > > > > > > > > You might end up reclaiming many small folios yet still strugg= le to > > > > > > allocate large folios. Conversely, > > > > > > the inverse scenario can occur when the LRU list's tail is pop= ulated > > > > > > with large folios. > > > > > > > > > > > > SF - SF - SF - LF - LF - LF - LF - LF - LF -LF - LF - LF - = LF - LF - LF - LF > > > > > > > > > > > > In OPPO's products, we allocate dedicated pageblocks solely for= large > > > > > > folios allocation, and we've > > > > > > fine-tuned the LRU mechanism to support dual LRU=E2=80=94one fo= r small folios > > > > > > and another for large ones. > > > > > > Dedicated page blocks offer a fundamental guarantee of allocati= ng > > > > > > large folios. Additionally, segregating > > > > > > small and large folios into two LRUs ensures that both can be > > > > > > efficiently reclaimed for their respective > > > > > > users' requests. However, while the implementation may lack ae= sthetic > > > > > > appeal and is primarily tailored > > > > > > for product purposes, it isn't fully upstreamable. > > > > > > > > > > > > You can obtain the architectural diagram of OPPO's approach fro= m link[2]. > > > > > > > > > > > > Therefore, my plan is to present: > > > > > > > > > > > > - Introduce the architecture of OPPO's mTHP-like approach, whic= h > > > > > > encompasses additional optimizations > > > > > > we've made to address swap fragmentation issues and improve s= wap > > > > > > performance, such as dual-zRAM > > > > > > and compression/decompression of large folios [3]. > > > > > > > > > > > > - Present OPPO's method of utilizing dedicated page blocks and = a > > > > > > dual-LRU system for mTHP. > > > > > > > > > > > > - Share our observations from employing Yu Zhao's TAO on Pixel = 6 phones. > > > > > > > > > > > > - Discuss our future direction=E2=80=94are we leaning towards T= AO or dedicated > > > > > > page blocks? If we opt for page > > > > > > blocks, how do we plan to resolve the LRU issue? > > > > > > > > > > > > [1] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-y= uzhao@google.com/ > > > > > > [2] https://github.com/21cnbao/mTHP/blob/main/largefoliosarch.p= ng > > > > > > [3] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21c= nbao@gmail.com/ > > > > > > > > Thanks Barry