From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6DA7C4828D for ; Mon, 5 Feb 2024 11:02:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5FA0F6B0072; Mon, 5 Feb 2024 06:02:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5AA106B0075; Mon, 5 Feb 2024 06:02:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 44A4C6B0078; Mon, 5 Feb 2024 06:02:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 338FE6B0072 for ; Mon, 5 Feb 2024 06:02:08 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id A504F120A00 for ; Mon, 5 Feb 2024 11:02:07 +0000 (UTC) X-FDA: 81757460694.28.76A7E35 Received: from mail-lj1-f182.google.com (mail-lj1-f182.google.com [209.85.208.182]) by imf17.hostedemail.com (Postfix) with ESMTP id 999AB40006 for ; Mon, 5 Feb 2024 11:02:04 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=aFR1M3cg; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of kundanthebest@gmail.com designates 209.85.208.182 as permitted sender) smtp.mailfrom=kundanthebest@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707130924; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WRKWF2s5pZKs2ak+JAWruyVvnUkzI/SygAeDIohlgmM=; b=XJci8EbEGy3nkvtTtSkcH4+JtR9hKqFl/5ZKcT4ofP00IOQGp33yw0P6V0obdQPtk5ndcu vi6TDPeXAg0pzCmLe+smdppiF9dOgzd5xd1jACINNnA8uU5vCfQUILXlh+HVpHr5Tu5pwO 3G4uTNQTAmPa4rIIxxVGoYegvlLm8lc= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=aFR1M3cg; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of kundanthebest@gmail.com designates 209.85.208.182 as permitted sender) smtp.mailfrom=kundanthebest@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707130924; a=rsa-sha256; cv=none; b=zo0wO4rAbIuAV8ChN0phB9nprcRag0S/FzDmOeaBxszdUThzRw/CnI/h9fZ0pRbnsI4Ouu XuyCHBugdEjSBqoVqikfk+0V2KvCtqTIkT1+g3EgqzbPS5wmZeHOEiVW/mgI6/ttL5mpHY 4qeJKF8BbML0I91aV8N0weuN3PWsdC8= Received: by mail-lj1-f182.google.com with SMTP id 38308e7fff4ca-2d0a236dae7so14975371fa.2 for ; Mon, 05 Feb 2024 03:02:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1707130923; x=1707735723; darn=kvack.org; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=WRKWF2s5pZKs2ak+JAWruyVvnUkzI/SygAeDIohlgmM=; b=aFR1M3cgu314bkOZGjsjHZKLdfBivD0awlUA9xSEbQowgGhT/dWJ1jbQJma5blsbWk O0btaFthGX6/uMFKhEROeX9uTLgLCGr/cOqvjBWYcT8rSD67UGVD1wolUOwctsy0Lmr2 EDyU24T7K07XKOdZ0MFXInuzEZKsQFZDFGUhgIhSjrQEwz1b1H77z1Kd2h/ZB7X/IkHU qdqsG36Gb/XJkulRYLv99ICXG21AslEoVLY8S1QOVSueOXNNZmJ6GM2Mc6yFygcpBpJ0 ChrH7k92PH5a8TAq1Da7dO2h6Ef8Ma/nHfgnjzDYrOH7kcGExS5wtgZLxOM+eFTu/xwf 7HvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707130923; x=1707735723; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=WRKWF2s5pZKs2ak+JAWruyVvnUkzI/SygAeDIohlgmM=; b=WK58q6WfGrhaUiSe6+IMcnwIsKvSbjYZqoeeadMc9wD4t8mWvAWixtHC0qJFkgqSdp VPdbfwkLpQu8skQOvhbQpBAnXPfbds1e7PAYRom0Ea0ec+5coDXmv6DV1G9B+H7Wq8bn gQja2BOaE9MEoUOgofHH4j+Ga+t8l+HArBaGWZdVDHZjpMpvEI+2wypV3eSn0ecBToJl 0YIMZZ7A/HPtoSCeTcUuutiCz7dd5CIm4EDV3Pe0I8nEPZ+iWk0FVxLhe0pILzXKPIdR DtmgLJ+JgMt18i1DPX1MtkJVT+udflfAakI3bYWX5ghjWKUN4KRUp4ivb7qrdtcfC+3r Ixuw== X-Gm-Message-State: AOJu0YzhxtDU9satWmP8+wI/PcR5VcdBokpyGj3q6fZ787iiVk5/uEXq VaC0mLp708R144eKjZTkOiKa0w3IwaHMKoXXiYKne7YsX486Jxbbr/g9TJOJCLLTu0TKNvxKoB2 uVgLWMH3PS/sUI8uxqUhZ4SUjiHo= X-Google-Smtp-Source: AGHT+IG8et4erCemiMXimfEQHUbmU4u4wFBf7xkjdpFOLWSsLCjDl0iCW7uryfOVFjpICWVQrMVeNgQ4F9OnqKH5Fzc= X-Received: by 2002:a2e:b8c7:0:b0:2d0:b115:65cb with SMTP id s7-20020a2eb8c7000000b002d0b11565cbmr993887ljp.39.1707130922203; Mon, 05 Feb 2024 03:02:02 -0800 (PST) MIME-Version: 1.0 Received: by 2002:a05:651c:338:b0:2d0:b286:dc71 with HTTP; Mon, 5 Feb 2024 03:02:01 -0800 (PST) In-Reply-To: <7b4bb92f-bd57-49c8-8b95-0e10408914fb@arm.com> References: <7b4bb92f-bd57-49c8-8b95-0e10408914fb@arm.com> From: Kundan Kumar Date: Mon, 5 Feb 2024 16:32:01 +0530 Message-ID: Subject: Re: Pages doesn't belong to same large order folio in block IO path To: Ryan Roberts Cc: "linux-mm@kvack.org" Content-Type: multipart/alternative; boundary="00000000000033fe7c0610a06462" X-Rspamd-Queue-Id: 999AB40006 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: 8bmrdrpodsokh8kttwkkmkfn6jopbddt X-HE-Tag: 1707130924-571082 X-HE-Meta: U2FsdGVkX1/bT4mqzrtZBsi0byJfoidZ5XCSBTwP1+z7mgReaMRwabvo9N1V+GvracfBSCOAxn4rhDQWh1z8JiXSrWB5x5WfQtDb1yQT8ddZcoIcAEDz/5g5pqWgb2K1A/UUAT3HvBapSMP2rCRq1T/imClwgSeNJZjLswhOVMMh7g/6Y4ycN1bes75Bg2kzIJ6fIImInlxQC/svbjlDS22uaChvkHO4Hn+ZqqwQFU3gOwTGxeb3F8XEGBUz/WPkvy/hwqJ5OYnv0hfLqzWPqmjvgkZb0gC6AHrvte7hvQLewD9BWiNoYoGvKscjDisJlLqFMRd2Nsr3i4vkfl5HJnTvyMO28XJe+VfcuU2nFK+4EWoJrCoT7hWL+0r7nhP52k9fyK0LBKVO+lTkUf3sr07GsMzEwwA+AFZeaq+NVu1TjIIw0XMs3/W6YdywwzQXX5oW8x8ABefAgDBjaD6M0XaNqPheRzbWpFEnGHK+qjNaRrJZuBb3+F5BLRi8maQRUhp4LsiBVoJRpINgH2fPixHBpZ6DtPjw+fLRtnHBhJORTBI+FyiqRrV2nWjs6uRHfQ+LFRmvJXLWt/NcVjPS6LZJdAXD6Mr1k4BgHQbCvsw/YV6oTUaSYj6cLg/1VwnBeteEsOuzTAzF+10UwMHoz7QFAcGuf8qHQVIYvyNhCf8286jNLSaqBAB6BNSf8GJ+W6LxbySxyIa8usQ51HsLclvQQaTRAZ6H8/xo92D6tgwnj1lCY9ZUswlwdhH6hLJF4vkPdnBz9ZgaQxrZX5kLq0Uqtyxw00cPsS+vdTE3AlFvn/xj0GaCcCQLMEmp3AuS99fhJ7Xrk4QIbTg73sFlxfBJQS0yweN6J1VcxItMBsWY8bYIi2jljYvNewlOtoS6OMK4HGkJ9zVkFTdpM0tiuXy0od2/RkE71iu05OL6PRDBMZyVlaAbLb2y/agXfUWjnLn6FipUGk8T5JcoJNS tokppt2G 4t/0vkMinlu/qQJL5ZQGwWO89QfjD1SUhJSNMs0uuifgagk2TgWNID6GHTj2ad3mIXZaas2MsUnJFw/BgryeKOdZJdqIi54rHK9yQXpis7g0IdefXA8aNm1QHekJjui7gbU6Z3IA3YX4NgUnhPUJYs1gf6cyw9Ws0u/2RoyuoR71iayHSUWMpat2cGH9AGsZbzKIvcePgTeh9ZMqfPdF0mtYVsJpNGSo2c6tWj2U+ke9lzxkVjYJx/C0kcMqdSuEywC1V/9k72R68Z9IbDNbgI3iJ4wSW8ptA/xC+wt83ME/npDb/sfIWZSAhqkV2g542wICAje9gHizZf0+jRl7yi1UamXBZp2bI4moP6BsnGGfieevfPaqb2rwQBVPcrvNCwJoxt65c0YhgxUg1f45dVKUICPwNdqbFBwIO3STejbo5byekf1CvoT7oDoWbpduInMHHXQfL8K6tNMydDRXqRvXkqZsZT5tcw735A6lJ22/FM2kkFRN4ffTkBZ80gu00V+TBStYbypI3PXeRpVtZruJa1dusJBsDyPJpcFKTw1AHuoZ3ghQidbk1Ao3fLy6653vwgsWDaDpjWpo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: --00000000000033fe7c0610a06462 Content-Type: text/plain; charset="UTF-8" On Monday, February 5, 2024, Ryan Roberts wrote: > On 05/02/2024 06:33, Kundan Kumar wrote: > > ------ Tessian Warning ------ > > > > Be careful, the email's sending address "kundanthebest@gmail[.]com" has > never been seen on your company's network before today > > > > This warning message will be removed if you reply to or forward this > email to a recipient outside of your organization. > > > > ---- Tessian Warning End ---- > > > > Hi All, > > > > I am using the patch "Multi-size THP for anonymous memory" > > https://lore.kernel.org/all/20231214160251.3574571-1-ryan. > roberts@arm.com/T/#u > > Thanks for trying this out! > > > > > > > I enabled the mTHP using the sysfs interface : > > echo always >/sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled > > echo always >/sys/kernel/mm/transparent_hugepage/hugepages-32kB/enabled > > echo always >/sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > > echo always >/sys/kernel/mm/transparent_hugepage/hugepages-128kB/enabled > > echo always >/sys/kernel/mm/transparent_hugepage/hugepages-256kB/enabled > > echo always >/sys/kernel/mm/transparent_hugepage/hugepages-512kB/enabled > > echo always >/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/ > enabled > > echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/ > enabled > > > > I can see this patch allocates multi-order folio for anonymous memory. > > > > With the large order folios getting allocated I tried direct block IO > using fio. > > > > fio -iodepth=1 -rw=write -ioengine=io_uring -direct=1 -bs=16K > > -numjobs=1 -size=16k -group_reporting -filename=/dev/nvme0n1 > > -name=io_uring_test > > I'm not familiar with fio, but my best guess is that this is an alignment > issue. > mTHP will only allocate a large folio if it can be naturally aligned in > virtual > memory. Assuming you are on a system with 4K base pages, then mmap will > allocate > a 16K portion of the VA space aligned to 4K, so there is a 3/4 chance that > it > won't be 16K aligned and then the system will have to allocate small > folios to > it. A quick grep of the manual suggests that -iomem_align=16K should solve > this. > > If that doesn't solve it, then there are a couple of other (less likely) > possibilities: > > The -iomem option defaults to malloc() when not explicitly provided. Is it > possible that your malloc implementation is using MADV_NOHUGEPAGE? This > would > prevent the allocation of large folios. This seems unlikely because I > would have > thought that malloc would pass 16K objects allocation through to mmap and > this > wouldn't apply. > > The only other possible reason that springs to mind is that if you have > enabled > all of the possible sizes and you are running on a very memory constrained > device then perhaps the physical memory is so fragmented that it can't > allocate > a large folio. This also feels unlikely though. > > If -iomem_align=16K doesn't solve it on its own, I'd suggest trying with > all > mTHP sizes disabled except for 16K (after a reboot just to be extra safe), > then > use the -iomem=mmap option, which the manual suggests will use mmap with > MAP_ANONYMOUS. > > > > > The fio malloced memory is allocated from a multi-order folio in > > function alloc_anon_folio(). > > Block I/O path takes the fio allocated memory and maps it in kernel in > > function iov_iter_extract_user_pages() > > As the pages are mapped using large folios, I try to see if the pages > > belong to same folio using page_folio(page) in function > > __bio_iov_iter_get_pages(). > > > > To my surprise I see that the pages belong to different folios. > > > > Feb 5 10:34:33 kernel: [244413.315660] 1603 > > iov_iter_extract_user_pages addr = 5593b252a000 > > Feb 5 10:34:33 kernel: [244413.315680] 1610 > > iov_iter_extract_user_pages nr_pages = 4 > > Feb 5 10:34:33 kernel: [244413.315700] 1291 __bio_iov_iter_get_pages > > page = ffffea000d4bb9c0 folio = ffffea000d4bb9c0 > > Feb 5 10:34:33 kernel: [244413.315749] 1291 __bio_iov_iter_get_pages > > page = ffffea000d796200 folio = ffffea000d796200 > > Feb 5 10:34:33 kernel: [244413.315796] 1291 __bio_iov_iter_get_pages > > page = ffffea000d796240 folio = ffffea000d796240 > > Feb 5 10:34:33 kernel: [244413.315852] 1291 __bio_iov_iter_get_pages > > page = ffffea000d7b2b80 folio = ffffea000d7b2b80 > > > > I repeat the same experiment with fio using HUGE pages > > fio -iodepth=1 -iomem=mmaphuge -rw=write -ioengine=io_uring -direct=1 > > -bs=16K -numjobs=1 -size=16k -group_reporting -filename=/dev/nvme0n1 > > -name=io_uring_test > > according to the manual -iomem=mmaphuge is using hugetlb. So that will > default > to 2M and always be naturally aligned in virtual space. So it makes sense > that > you are seeing pages that belong to the same folio here. > > > > > This time when the memory is mmapped from HUGE pages I see that pages > belong > > to the same folio. > > > > Feb 5 10:51:50 kernel: [245450.439817] 1603 > > iov_iter_extract_user_pages addr = 7f66e4c00000 > > Feb 5 10:51:50 kernel: [245450.439825] 1610 > > iov_iter_extract_user_pages nr_pages = 4 > > Feb 5 10:51:50 kernel: [245450.439834] 1291 __bio_iov_iter_get_pages > > page = ffffea0005bc8000 folio = ffffea0005bc8000 > > Feb 5 10:51:50 kernel: [245450.439858] 1291 __bio_iov_iter_get_pages > > page = ffffea0005bc8040 folio = ffffea0005bc8000 > > Feb 5 10:51:50 kernel: [245450.439880] 1291 __bio_iov_iter_get_pages > > page = ffffea0005bc8080 folio = ffffea0005bc8000 > > Feb 5 10:51:50 kernel: [245450.439903] 1291 __bio_iov_iter_get_pages > > page = ffffea0005bc80c0 folio = ffffea0005bc8000 > > > > Please let me know if you have any clue as to why the pages for malloced > memory > > of fio don't belong to the same folio. > > Let me know if -iomem_aligned=16K solves it for you! > > Thanks, > Ryan > Thanks Ryan for help and good elaborate reply. I tried various combinations. Good news is mmap and aligned memory allocates large folio and solves the issue. Lets see the various cases one by one : ============== Aligned malloc ============== Only the align didnt solve the issue. The command I used : fio -iodepth=1 -iomem_align=16K -rw=write -ioengine=io_uring -direct=1 -hipri -bs=16K -numjobs=1 -size=16k -group_reporting -filename=/dev/nvme0n1 -name=io_uring_test The block IO path has separate pages and separate folios. Logs Feb 5 15:27:32 kernel: [261992.075752] 1603 iov_iter_extract_user_pages addr = 55b2a0542000 Feb 5 15:27:32 kernel: [261992.075762] 1610 iov_iter_extract_user_pages nr_pages = 4 Feb 5 15:27:32 kernel: [261992.075786] 1291 __bio_iov_iter_get_pages page = ffffea000d9461c0 folio = ffffea000d9461c0 Feb 5 15:27:32 kernel: [261992.075812] 1291 __bio_iov_iter_get_pages page = ffffea000d7ef7c0 folio = ffffea000d7ef7c0 Feb 5 15:27:32 kernel: [261992.075836] 1291 __bio_iov_iter_get_pages page = ffffea000d7d30c0 folio = ffffea000d7d30c0 Feb 5 15:27:32 kernel: [261992.075861] 1291 __bio_iov_iter_get_pages page = ffffea000d7f2680 folio = ffffea000d7f2680 ============== Non aligned mmap ============== mmap not aligned does somewhat better, we see 3 pages from same folio fio -iodepth=1 -iomem=mmap -rw=write -ioengine=io_uring -direct=1 -hipri -bs=16K -numjobs=1 -size=16k -group_reporting -filename=/dev/nvme0n1 -name=io_uring_test Feb 5 15:31:08 kernel: [262208.082789] 1603 iov_iter_extract_user_pages addr = 7f72bc711000 Feb 5 15:31:08 kernel: [262208.082808] 1610 iov_iter_extract_user_pages nr_pages = 4 Feb 5 15:24:31 kernel: [261811.086973] 1291 __bio_iov_iter_get_pages page = ffffea000aed36c0 folio = ffffea000aed36c0 Feb 5 15:24:31 kernel: [261811.087010] 1291 __bio_iov_iter_get_pages page = ffffea000d2d0200 folio = ffffea000d2d0200 Feb 5 15:24:31 kernel: [261811.087044] 1291 __bio_iov_iter_get_pages page = ffffea000d2d0240 folio = ffffea000d2d0200 Feb 5 15:24:31 kernel: [261811.087078] 1291 __bio_iov_iter_get_pages page = ffffea000d2d0280 folio = ffffea000d2d0200 ============== Aligned mmap ============== mmap and aligned "-iomem_align=16K -iomem=mmap" solves the issue !!! Even with all the mTHP sizes enabled I see that 1 folio is present corresponding to the 4 pages. fio -iodepth=1 -iomem_align=16K -iomem=mmap -rw=write -ioengine=io_uring -direct=1 -hipri -bs=16K -numjobs=1 -size=16k -group_reporting -filename=/dev/nvme0n1 -name=io_uring_test Feb 5 15:29:36 kernel: [262115.791589] 1603 iov_iter_extract_user_pages addr = 7f5c9087b000 Feb 5 15:29:36 kernel: [262115.791611] 1610 iov_iter_extract_user_pages nr_pages = 4 Feb 5 15:29:36 kernel: [262115.791635] 1291 __bio_iov_iter_get_pages page = ffffea000e0116c0 folio = ffffea000e011600 Feb 5 15:29:36 kernel: [262115.791696] 1291 __bio_iov_iter_get_pages page = ffffea000e011700 folio = ffffea000e011600 Feb 5 15:29:36 kernel: [262115.791755] 1291 __bio_iov_iter_get_pages page = ffffea000e011740 folio = ffffea000e011600 Feb 5 15:29:36 kernel: [262115.791814] 1291 __bio_iov_iter_get_pages page = ffffea000e011780 folio = ffffea000e011600 So it looks like normal malloc even if aligned doesn't allocate large order folios. Only if we do a mmap which sets the flag "OS_MAP_ANON | MAP_PRIVATE" then we get the same folio. I was under assumption that malloc will internally use mmap with MAP_ANON and we shall get same folio. For just the malloc case : On another front I have logs in alloc_anon_folio. For just the malloc case I see allocation of 64 pages. "addr = 5654feac0000" is the address malloced by fio(without align and without mmap) Feb 5 15:56:56 kernel: [263756.413095] alloc_anon_folio comm=fio order = 6 folio = ffffea000e044000 addr = 5654feac0000 vma = ffff88814cfc7c20 Feb 5 15:56:56 kernel: [263756.413110] alloc_anon_folio comm=fio folio_nr_pages = 64 64 pages with be 0x40000, when added to 5654feac0000 we get 5654feb00000. So this range user space address shall be covered in this folio itself. And after this when IO is issued I see the user space address passed in this range to block IO path. But the code of iov_iter_extract_user_pages() doesnt fetch the same pages/folio. Feb 5 15:56:57 kernel: [263756.678586] 1603 iov_iter_extract_user_pages addr = 5654fead4000 Feb 5 15:56:57 kernel: [263756.678606] 1610 iov_iter_extract_user_pages nr_pages = 4 Feb 5 15:56:57 kernel: [263756.678629] 1291 __bio_iov_iter_get_pages page = ffffea000dfc2b80 folio = ffffea000dfc2b80 Feb 5 15:56:57 kernel: [263756.678684] 1291 __bio_iov_iter_get_pages page = ffffea000dfc2bc0 folio = ffffea000dfc2bc0 Feb 5 15:56:57 kernel: [263756.678738] 1291 __bio_iov_iter_get_pages page = ffffea000d7b9100 folio = ffffea000d7b9100 Feb 5 15:56:57 kernel: [263756.678790] 1291 __bio_iov_iter_get_pages page = ffffea000d7b9140 folio = ffffea000d7b9140 Please let me know your thoughts on same. -- Kundan Kumar --00000000000033fe7c0610a06462 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

On Monday, February 5, 2024, Ryan Roberts <ryan.roberts@arm.com> wrote:
On 05/02/2024 06:33, Kundan Kumar wrote:
> ------ Tessian Warning ------
>
> Be careful, the email's sending address "kundanthebest@gmail[= .]com" has never been seen on your company's network before today<= br> >
> This warning message will be removed if you reply to or forward this e= mail to a recipient outside of your organization.
>
> ---- Tessian Warning End ----
>
> Hi All,
>
> I am using the patch "Multi-size THP for anonymous memory" > https://lore.kernel.org/all/202= 31214160251.3574571-1-ryan.roberts@arm.com/T/#u

Thanks for trying this out!

>
>
> I enabled the mTHP using the sysfs interface :
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-16k= B/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-32k= B/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-64k= B/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-128= kB/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-256= kB/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-512= kB/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-102= 4kB/enabled
> echo always >/sys/kernel/mm/transparent_hugepage/hugepages-204= 8kB/enabled
>
> I can see this patch allocates multi-order folio for anonymous memory.=
>
> With the large order folios getting allocated I tried direct block IO = using fio.
>
> fio -iodepth=3D1 -rw=3Dwrite -ioengine=3Dio_uring -direct=3D1 -bs=3D16= K
> -numjobs=3D1 -size=3D16k -group_reporting -filename=3D/dev/nvme0n1
> -name=3Dio_uring_test

I'm not familiar with fio, but my best guess is that this is an alignme= nt issue.
mTHP will only allocate a large folio if it can be naturally aligned in vir= tual
memory. Assuming you are on a system with 4K base pages, then mmap will all= ocate
a 16K portion of the VA space aligned to 4K, so there is a 3/4 chance that = it
won't be 16K aligned and then the system will have to allocate small fo= lios to
it. A quick grep of the manual suggests that -iomem_align=3D16K should solv= e this.

If that doesn't solve it, then there are a couple of other (less likely= )
possibilities:

The -iomem option defaults to malloc() when not explicitly provided. Is it<= br> possible that your malloc implementation is using MADV_NOHUGEPAGE? This wou= ld
prevent the allocation of large folios. This seems unlikely because I would= have
thought that malloc would pass 16K objects allocation through to mmap and t= his
wouldn't apply.

The only other possible reason that springs to mind is that if you have ena= bled
all of the possible sizes and you are running on a very memory constrained<= br> device then perhaps the physical memory is so fragmented that it can't = allocate
a large folio. This also feels unlikely though.

If -iomem_align=3D16K doesn't solve it on its own, I'd suggest tryi= ng with all
mTHP sizes disabled except for 16K (after a reboot just to be extra safe), = then
use the -iomem=3Dmmap option, which the manual suggests will use mmap with<= br> MAP_ANONYMOUS.

>
> The fio malloced memory is allocated from a multi-order folio in
> function alloc_anon_folio().
> Block I/O path takes the fio allocated memory and maps it in kernel in=
> function iov_iter_extract_user_pages()
> As the pages are mapped using large folios, I try to see if the pages<= br> > belong to same folio using page_folio(page) in function
> __bio_iov_iter_get_pages().
>
> To my surprise I see that the pages belong to different folios.
>
> Feb=C2=A0 5 10:34:33 kernel: [244413.315660] 1603
> iov_iter_extract_user_pages addr =3D 5593b252a000
> Feb=C2=A0 5 10:34:33 kernel: [244413.315680] 1610
> iov_iter_extract_user_pages nr_pages =3D 4
> Feb=C2=A0 5 10:34:33 kernel: [244413.315700] 1291 __bio_iov_iter_get_p= ages
> page =3D ffffea000d4bb9c0 folio =3D ffffea000d4bb9c0
> Feb=C2=A0 5 10:34:33 kernel: [244413.315749] 1291 __bio_iov_iter_get_p= ages
> page =3D ffffea000d796200 folio =3D ffffea000d796200
> Feb=C2=A0 5 10:34:33 kernel: [244413.315796] 1291 __bio_iov_iter_get_p= ages
> page =3D ffffea000d796240 folio =3D ffffea000d796240
> Feb=C2=A0 5 10:34:33 kernel: [244413.315852] 1291 __bio_iov_iter_get_p= ages
> page =3D ffffea000d7b2b80 folio =3D ffffea000d7b2b80
>
> I repeat the same experiment with fio using HUGE pages
> fio -iodepth=3D1 -iomem=3Dmmaphuge -rw=3Dwrite -ioengine=3Dio_uring -d= irect=3D1
> -bs=3D16K -numjobs=3D1 -size=3D16k -group_reporting -filename=3D/dev/n= vme0n1
> -name=3Dio_uring_test

according to the manual -iomem=3Dmmaphuge is using hugetlb. So that will de= fault
to 2M and always be naturally aligned in virtual space. So it makes sense t= hat
you are seeing pages that belong to the same folio here.

>
> This time when the memory is mmapped from HUGE pages I see that pages = belong
> to the same folio.
>
> Feb=C2=A0 5 10:51:50 kernel: [245450.439817] 1603
> iov_iter_extract_user_pages addr =3D 7f66e4c00000
> Feb=C2=A0 5 10:51:50 kernel: [245450.439825] 1610
> iov_iter_extract_user_pages nr_pages =3D 4
> Feb=C2=A0 5 10:51:50 kernel: [245450.439834] 1291 __bio_iov_iter_get_p= ages
> page =3D ffffea0005bc8000 folio =3D ffffea0005bc8000
> Feb=C2=A0 5 10:51:50 kernel: [245450.439858] 1291 __bio_iov_iter_get_p= ages
> page =3D ffffea0005bc8040 folio =3D ffffea0005bc8000
> Feb=C2=A0 5 10:51:50 kernel: [245450.439880] 1291 __bio_iov_iter_get_p= ages
> page =3D ffffea0005bc8080 folio =3D ffffea0005bc8000
> Feb=C2=A0 5 10:51:50 kernel: [245450.439903] 1291 __bio_iov_iter_get_p= ages
> page =3D ffffea0005bc80c0 folio =3D ffffea0005bc8000
>
> Please let me know if you have any clue as to why the pages for malloc= ed memory
> of fio don't belong to the same folio.

Let me know if -iomem_aligned=3D16K solves it for you!

Thanks,
Ryan

Thanks Ryan for help and good elab= orate reply.

I tried various combinations. Good news is mmap and aligned memory allocat= es large folio and solves the issue.
Lets see the various cases o= ne by one :

= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Aligned malloc=C2=A0
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Only the alig= n didnt solve the issue. The command I used :
fio -iodepth=3D1 -i= omem_align=3D16K -rw=3Dwrite -ioengine=3Dio_uring -direct=3D1 -hipri -bs=3D= 16K -numjobs=3D1 -size=3D16k -group_reporting -filename=3D/dev/nvme0n1 -nam= e=3Dio_uring_test
The block IO path has separate pages and separa= te folios.
Logs
Feb=C2=A0 5 15:27:32 kernel: [261992.07= 5752] 1603 iov_iter_extract_user_pages addr =3D 55b2a0542000
Feb= =C2=A0 5 15:27:32 kernel: [261992.075762] 1610 iov_iter_extract_user_pages = nr_pages =3D 4
Feb=C2=A0 5 15:27:32 kernel: [261992.075786] 1291 = __bio_iov_iter_get_pages page =3D ffffea000d9461c0 folio =3D ffffea000d9461= c0
Feb=C2=A0 5 15:27:32 kernel: [261992.075812] 1291 __bio_iov_it= er_get_pages page =3D ffffea000d7ef7c0 folio =3D ffffea000d7ef7c0
Feb=C2=A0 5 15:27:32 kernel: [261992.075836] 1291 __bio_iov_iter_get_pages= page =3D ffffea000d7d30c0 folio =3D ffffea000d7d30c0
Feb=C2=A0 5= 15:27:32 kernel: [261992.075861] 1291 __bio_iov_iter_get_pages page =3D ff= ffea000d7f2680 folio =3D ffffea000d7f2680


=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Non aligned mm= ap=C2=A0
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
mma= p not aligned does somewhat better, we see 3 pages from same folio
fio -iodepth=3D1=C2=A0 -iomem=3Dmmap -rw=3Dwrite -ioengine=3Dio_uring -di= rect=3D1 -hipri -bs=3D16K -numjobs=3D1 -size=3D16k -group_reporting -filena= me=3D/dev/nvme0n1 -name=3Dio_uring_test
Feb=C2=A0 5 15:31:08 kern= el: [262208.082789] 1603 iov_iter_extract_user_pages addr =3D 7f72bc711000<= /div>
Feb=C2=A0 5 15:31:08 kernel: [262208.082808] 1610 iov_iter_extrac= t_user_pages nr_pages =3D 4
Feb=C2=A0 5 15:24:31 kernel: [261811.= 086973] 1291 __bio_iov_iter_get_pages page =3D ffffea000aed36c0 folio =3D f= fffea000aed36c0
Feb=C2=A0 5 15:24:31 kernel: [261811.087010] 1291= __bio_iov_iter_get_pages page =3D ffffea000d2d0200 folio =3D ffffea000d2d0= 200
Feb=C2=A0 5 15:24:31 kernel: [261811.087044] 1291 __bio_iov_i= ter_get_pages page =3D ffffea000d2d0240 folio =3D ffffea000d2d0200
Feb=C2=A0 5 15:24:31 kernel: [261811.087078] 1291 __bio_iov_iter_get_page= s page =3D ffffea000d2d0280 folio =3D ffffea000d2d0200

<= br>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
A= ligned mmap=C2=A0
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
mmap and aligned "-iomem_align=3D16K -iomem=3Dmmap" solves = the issue !!!
Even with all the mTHP sizes enabled I see that 1 f= olio is present
corresponding to the 4 pages.

fio -iodepth=3D1 -iomem_align= =3D16K -iomem=3Dmmap -rw=3Dwrite -ioengine=3Dio_uring -direct=3D1 -hipri -b= s=3D16K -numjobs=3D1 -size=3D16k -group_reporting -filename=3D/dev/nvme0n1 = -name=3Dio_uring_test
Feb=C2=A0 5 15:29:36 kernel: [262115.791589= ] 1603 iov_iter_extract_user_pages addr =3D 7f5c9087b000
Feb=C2= =A0 5 15:29:36 kernel: [262115.791611] 1610 iov_iter_extract_user_pages nr_= pages =3D 4
Feb=C2=A0 5 15:29:36 kernel: [262115.791635] 1291 __b= io_iov_iter_get_pages page =3D ffffea000e0116c0 folio =3D ffffea000e011600<= /div>
Feb=C2=A0 5 15:29:36 kernel: [262115.791696] 1291 __bio_iov_iter_= get_pages page =3D ffffea000e011700 folio =3D ffffea000e011600
Fe= b=C2=A0 5 15:29:36 kernel: [262115.791755] 1291 __bio_iov_iter_get_pages pa= ge =3D ffffea000e011740 folio =3D ffffea000e011600
Feb=C2=A0 5 15= :29:36 kernel: [262115.791814] 1291 __bio_iov_iter_get_pages page =3D ffffe= a000e011780 folio =3D ffffea000e011600

So it looks like normal malloc even if aligned d= oesn't allocate large order
folios. Only if we do a mmap whic= h sets the flag "OS_MAP_ANON | MAP_PRIVATE"
then we get= the same folio.

=
I was under assumption that malloc will internally use mmap with MAP_A= NON
and we shall get same folio.


For just the malloc case :=C2=A0

On another front I have logs in alloc_anon_fo= lio. For just the malloc case I
see allocation of 64 pages. "= ;addr =3D 5654feac0000" is the address malloced by
fio(witho= ut align and without mmap)=C2=A0
<= br>
Feb=C2=A0 5 15:56:56 kernel: [263756.413095] alloc_ano= n_folio comm=3Dfio order =3D 6 folio =3D ffffea000e044000 addr =3D 5654feac= 0000 vma =3D ffff88814cfc7c20
Feb=C2=A0 5 15:56:56 kernel: [26375= 6.413110] alloc_anon_folio comm=3Dfio folio_nr_pages =3D 64

64 pages with be 0x40000, w= hen added to 5654feac0000 we get 5654feb00000.=C2=A0
So this rang= e user space address shall be covered in this folio itself.=C2=A0

And after this when I= O is issued I see the user space address passed in this
range to = block IO path. But the code of iov_iter_extract_user_pages() doesnt
fetch the same pages/folio.
Feb=C2=A0 5 15:56:57 kernel: [2637= 56.678586] 1603 iov_iter_extract_user_pages addr =3D 5654fead4000
Feb=C2=A0 5 15:56:57 kernel: [263756.678606] 1610 iov_iter_extract_user_pa= ges nr_pages =3D 4
Feb=C2=A0 5 15:56:57 kernel: [263756.678629] 1= 291 __bio_iov_iter_get_pages page =3D ffffea000dfc2b80 folio =3D ffffea000d= fc2b80
Feb=C2=A0 5 15:56:57 kernel: [263756.678684] 1291 __bio_io= v_iter_get_pages page =3D ffffea000dfc2bc0 folio =3D ffffea000dfc2bc0
=
Feb=C2=A0 5 15:56:57 kernel: [263756.678738] 1291 __bio_iov_iter_get_p= ages page =3D ffffea000d7b9100 folio =3D ffffea000d7b9100
Feb=C2= =A0 5 15:56:57 kernel: [263756.678790] 1291 __bio_iov_iter_get_pages page = =3D ffffea000d7b9140 folio =3D ffffea000d7b9140

Please let me know your thoughts on sam= e.

--
<= div>Kundan Kumar --00000000000033fe7c0610a06462--