From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1BA1AC4706C for ; Fri, 12 Jan 2024 22:13:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4E9EA6B0083; Fri, 12 Jan 2024 17:13:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 499C46B008C; Fri, 12 Jan 2024 17:13:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 361906B0092; Fri, 12 Jan 2024 17:13:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 254416B0083 for ; Fri, 12 Jan 2024 17:13:16 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id E4E0C1A004B for ; Fri, 12 Jan 2024 22:13:15 +0000 (UTC) X-FDA: 81672060750.15.6CB675A Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com [209.85.219.169]) by imf07.hostedemail.com (Postfix) with ESMTP id 2AD114000C for ; Fri, 12 Jan 2024 22:13:13 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AWfW2Gjp; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705097594; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YvOCHmUc0lLvZimXZvMNfX+hZiconfYp5v7rp8iwL5k=; b=CW9JTcnvz4gm7PJlDviFJIbTZB4zAwEsngEkpRmKpoFHbJWq5GytDu0jHotTTvcqjlHCXC qfZtUVbA3vh6PVkUAxjXjwuHl45nAlXQEdf+737cTB+7648+X06hpPwp93P3OBaBAjWIpx Hco3jrpFbMbFkJTOoXIaXCE2nyU/tPg= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AWfW2Gjp; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.219.169 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705097594; a=rsa-sha256; cv=none; b=RTUmKCYbPfH8WuVeDJWPx29s9rrPGc43AziAxXxgKEf8EYuUhXjzH8w0pun2SdLgzm/IYA d0DYQeHdEGQypYFBD1chV3//RQBxPanFE0iye8w/OUjRRfI/LT3qrDNpMm77gaog94FXhx phgoL6gHPDDYlDApEUHN+AmqFTGPA/I= Received: by mail-yb1-f169.google.com with SMTP id 3f1490d57ef6-dbed729a4f2so6258389276.2 for ; Fri, 12 Jan 2024 14:13:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1705097593; x=1705702393; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=YvOCHmUc0lLvZimXZvMNfX+hZiconfYp5v7rp8iwL5k=; b=AWfW2GjpXFcptk6nlR+TcHCojxOFzI6MwBi/OY34o3Vt4g3yIkDYDjsgfL/YAWcsJy 1+M7szesrFGrack2/UYmlanfhRFlt9CyRwhWkRYyNf6OvCbOrbVs1N4I1cVQ7eDruR3Z M5FqBgxkGwvryH43DR+zit6+9ZTzdF/fEcV6oZ4A0DS7DuJJ0ScSQMoHFDUD1WENUj+w 4KOUYUonr6Ua5vvC8dxLEgzEJN+0PLCgTd+EBXHoKdu2b2dk7JqDjv9dih2yyZR2cSL1 tpLbDePgHYck1HL3IDO8JM2OsxRJbo3T79BLu4AtQjQdgR6l/Wc7l5MWOd2SGfaFn68U tQYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705097593; x=1705702393; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=YvOCHmUc0lLvZimXZvMNfX+hZiconfYp5v7rp8iwL5k=; b=NZ7fzAy/CujhF3pI9VjT6Z8Eh99nXU4uQJ8LqtYX9UROzfG72bJh0q67C9A8V0vy/U 3cGvzUtP8wlIFBnafqlgmZO5XzIbk2WJ9b4SwED5W9YOcdhAo0Miv2J/cVGd7VymsHVi Kcb/7+hwuzXddryYt913Ec+sfDDTLfqJzf+XTxGm2kYwEgLEFEemWaKhs/nAkHymfgM/ k7vN/2TWwM2aK4+KtOhkv8bncp+VuVsYHVz6wY/KrPoPikLmJ4jpk17pjpJLOhVJps3L W9wrecEfT0d3AKRhCrnyNXJN/AFzSddnUuo6qu0nNmAC8hNIvtZTJGpiT/BAnBWol212 Gc6g== X-Gm-Message-State: AOJu0Yxb8YTunpD03YF8nNTbf0ynLx9yzrcHFF6YJLrVwSxXyH6Apaqz j0zWqlqOwKxvL+HgarS48JVoemqA6RMGof4P7Io= X-Google-Smtp-Source: AGHT+IHC3gA4oy6vewLDEr58W+knh0O1swPVnISiPtzYc7c8rvKGBileTPTfqqeAQ/jZ+JosheFztvyAMbu9iudYYE4= X-Received: by 2002:a25:ef45:0:b0:dbd:5cd8:1a4f with SMTP id w5-20020a25ef45000000b00dbd5cd81a4fmr1268714ybm.54.1705097592997; Fri, 12 Jan 2024 14:13:12 -0800 (PST) MIME-Version: 1.0 References: <20240111154106.3692206-1-ryan.roberts@arm.com> <654df189-e472-4a75-b2be-6faa8ba18a08@arm.com> In-Reply-To: <654df189-e472-4a75-b2be-6faa8ba18a08@arm.com> From: Barry Song <21cnbao@gmail.com> Date: Sat, 13 Jan 2024 11:13:01 +1300 Message-ID: Subject: Re: [RFC PATCH v1] mm/filemap: Allow arch to request folio size for exec memory To: Ryan Roberts Cc: Catalin Marinas , Will Deacon , Mark Rutland , "Matthew Wilcox (Oracle)" , Andrew Morton , David Hildenbrand , John Hubbard , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 2AD114000C X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: h98544won8e3h8pxx4ow8mu6yd4rmxdn X-HE-Tag: 1705097593-159669 X-HE-Meta: U2FsdGVkX19uOAIG1NOuGvqVMoeQyYsG8Ysjr5vDtNcmaZTIOm9mwcUVtjFICwVh54qwVTpOeIDzYBlXpHbGXDqk+GPDKgSSNnGA5nxv5RceYTkvmOxAMEb28x1UHb6ZQvtN9Gog+0sWhLcgCKyfe0MSPjxU5EJPdMywxHH1+5lBAiBvjdaJHppNVxsKtL4o8JtlQJQOJLgbJnpVTnMLg192JYDiAMtpvdQ8G0oL1ussy78+rmz11Afn+6e9LHNBoykHGDCsa1l+a82qKuerjzq/v4S0vxSHZuhuDYrfuGEmUV2JNwHRvtY8fCFJ+c5WvgIQf6cFCs9t2vKsKuHoNFoRJkUSETis2DpVhqv55gVVE5sYIFYf2+HY+EnlyZe3YpuajwI8m1JCicuaHEinW0mxKOxli+iok2TnLH0+uIdPSb5enDpZjIgrBiuOki/C4osfQgHI0LUDdI7uY0vS+JZ3SUNhdisWbQG8A32UFUyrWEc6H5qNDGP4DznJNRrVXfpoz+6hijfw+msFwWYdop+uV5GiF6FgUDEaY1Kd4FDNITRZxvEbxf2b+PstoE8MBAlcIglZ+PGeudWqITlp/QTxXr+D/8rwKi6pVe8eK/vbP+V1ec205ppdiHJsdzyInC0ddcGOqnz16llb73DeR6EksriHEiWst4iJRzSrpHjX9azYRHrNy12slgyjn9eFfRXcq4v+6nWvFwek8hxXSg8bjblNLYC6nHqkJTmQWi0krlmyMKMkgE7bJaBT9LXckhzMO0gm/RJDem2XSf3Cbyn7xmexcys5+v3eyT4i9hWCHohLG/kIUTu4IcTug+T/bxSgwzHtJcqvB9b0SumOh4hKZbeh+FN4LmUV40ZH49eOIPFVmmeq48zk+CeQbgtp2ccHzAjV1qJ+SS667wnQ6paAnsDN2p9rgx5cBy6SRvPaNmrGCgQzeWGPCwRuCzoSgES7nneS5iLuIeHTnQj gggxCBu1 uWQpis7Snan22tB8pPGGf4BoWuFijuZTXlRw23mpabP5FCpy8/jm7fJBwdbSWWY3ciWc/H0ACXdMTt0+u3LGtdKPxOsrohFJXI8/LP6aejrFWVxWmFO4m6zRYEFcVyYEnNRpyAUWtq050X9FnTHNSgRDFZ1IL3KuHZ2P+iCJLv2G/Y0EFgIYcihEc2HQU+Py55JIsHK73hJXa1j9VWgxE6SVe5/gUy9RWG8UYStTxr+YpLo8vBfABG2+H7eKqdMrnGPjS6J32YLk5D6FKamSdzb6gT6ucg+OKTSUQBMCCyz42GApaahCo7fWl3e4skGlRjxCIqgrZ4rtqEsdhI54VOd4OwouckpCfRSWngfIqPT/MIWSEHFGd779zFv+pZLxaDvYzRLVQyDhrRl7LAXBUt2kOqp7Iic59DSQ0cd7z6gWFDZZxe5mtzL0d2F3X45kP9u65 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Jan 13, 2024 at 12:15=E2=80=AFAM Ryan Roberts wrote: > > On 12/01/2024 10:13, Barry Song wrote: > > On Fri, Jan 12, 2024 at 4:41=E2=80=AFAM Ryan Roberts wrote: > >> > >> Change the readahead config so that if it is being requested for an > >> executable mapping, do a synchronous read of an arch-specified size in= a > >> naturally aligned manner. > >> > >> On arm64 if memory is physically contiguous and naturally aligned to t= he > >> "contpte" size, we can use contpte mappings, which improves utilizatio= n > >> of the TLB. When paired with the "multi-size THP" changes, this works > >> well to reduce dTLB pressure. However iTLB pressure is still high due = to > >> executable mappings having a low liklihood of being in the required > >> folio size and mapping alignment, even when the filesystem supports > >> readahead into large folios (e.g. XFS). > >> > >> The reason for the low liklihood is that the current readahead algorit= hm > >> starts with an order-2 folio and increases the folio order by 2 every > >> time the readahead mark is hit. But most executable memory is faulted = in > >> fairly randomly and so the readahead mark is rarely hit and most > >> executable folios remain order-2. This is observed impirically and > >> confirmed from discussion with a gnu linker expert; in general, the > >> linker does nothing to group temporally accessed text together > >> spacially. Additionally, with the current read-around approach there a= re > >> no alignment guarrantees between the file and folio. This is > >> insufficient for arm64's contpte mapping requirement (order-4 for 4K > >> base pages). > >> > >> So it seems reasonable to special-case the read(ahead) logic for > >> executable mappings. The trade-off is performance improvement (due to > >> more efficient storage of the translations in iTLB) vs potential read > >> amplification (due to reading too much data around the fault which won= 't > >> be used), and the latter is independent of base page size. I've chosen > >> 64K folio size for arm64 which benefits both the 4K and 16K base page > >> size configs and shouldn't lead to any further read-amplification sinc= e > >> the old read-around path was (usually) reading blocks of 128K (with th= e > >> last 32K being async). > >> > >> Performance Benchmarking > >> ------------------------ > >> > >> The below shows kernel compilation and speedometer javascript benchmar= ks > >> on Ampere Altra arm64 system. (The contpte patch series is applied in > >> the baseline). > >> > >> First, confirmation that this patch causes more memory to be contained > >> in 64K folios (this is for all file-backed memory so includes > >> non-executable too): > >> > >> | File-backed folios | Speedometer | Kernel Compile | > >> | by size as percentage |-----------------|-----------------| > >> | of all mapped file mem | before | after | before | after | > >> |=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D|=3D=3D=3D=3D=3D=3D=3D=3D|=3D=3D=3D=3D=3D=3D=3D=3D|=3D=3D=3D=3D=3D=3D= =3D=3D|=3D=3D=3D=3D=3D=3D=3D=3D| > >> |file-thp-aligned-16kB | 45% | 9% | 46% | 7% | > >> |file-thp-aligned-32kB | 2% | 0% | 3% | 1% | > >> |file-thp-aligned-64kB | 3% | 63% | 5% | 80% | > >> |file-thp-aligned-128kB | 11% | 11% | 0% | 0% | > >> |file-thp-unaligned-16kB | 1% | 0% | 3% | 1% | > >> |file-thp-unaligned-128kB | 1% | 0% | 0% | 0% | > >> |file-thp-partial | 0% | 0% | 0% | 0% | > >> |-------------------------|--------|--------|--------|--------| > >> |file-cont-aligned-64kB | 16% | 75% | 5% | 80% | > >> > >> The above shows that for both use cases, the amount of file memory > >> backed by 16K folios reduces and the amount backed by 64K folios > >> increases significantly. And the amount of memory that is contpte-mapp= ed > >> significantly increases (last line). > >> > >> And this is reflected in performance improvement: > >> > >> Kernel Compilation (smaller is faster): > >> | kernel | real-time | kern-time | user-time | peak memory | > >> |----------|-------------|-------------|-------------|---------------| > >> | before | 0.0% | 0.0% | 0.0% | 0.0% | > >> | after | -1.6% | -2.1% | -1.7% | 0.0% | > >> > >> Speedometer (bigger is faster): > >> | kernel | runs_per_min | peak memory | > >> |----------|----------------|---------------| > >> | before | 0.0% | 0.0% | > >> | after | 1.3% | 1.0% | > >> > >> Both benchmarks show a ~1.5% improvement once the patch is applied. > >> > >> Alternatives > >> ------------ > >> > >> I considered (and rejected for now - but I anticipate this patch will > >> stimulate discussion around what the best approach is) alternative > >> approaches: > >> > >> - Expose a global user-controlled knob to set the preferred folio > >> size; this would move policy to user space and allow (e.g.) settin= g > >> it to PMD-size for even better iTLB utilizaiton. But this would ad= d > >> ABI, and I prefer to start with the simplest approach first. It al= so > >> has the downside that a change wouldn't apply to memory already in > >> the page cache that is in active use (e.g. libc) so we don't get t= he > >> same level of utilization as for something that is fixed from boot= . > >> > >> - Add a per-vma attribute to allow user space to specify preferred > >> folio size for memory faulted from the range. (we've talked about > >> such a control in the context of mTHP). The dynamic loader would > >> then be responsible for adding the annotations. Again this feels > >> like something that could be added later if value was demonstrated= . > >> > >> - Enhance MADV_COLLAPSE to collapse to THP sizes less than PMD-size. > >> This would still require dynamic linker involvement, but would > >> additionally neccessitate a copy and all memory in the range would > >> be synchronously faulted in, adding to application load time. It > >> would work for filesystems that don't support large folios though. > >> > >> Signed-off-by: Ryan Roberts > >> --- > >> > >> Hi all, > >> > >> I originally concocted something similar to this, with Matthew's help,= as a > >> quick proof of concept hack. Since then I've tried a few different app= roaches > >> but always came back to this as the simplest solution. I expect this w= ill raise > >> a few eyebrows but given it is providing a real performance win, I hop= e we can > >> converge to something that can be upstreamed. > >> > >> This depends on my contpte series to actually set the contiguous bit i= n the page > >> table. > >> > >> Thanks, > >> Ryan > >> > >> > >> arch/arm64/include/asm/pgtable.h | 12 ++++++++++++ > >> include/linux/pgtable.h | 12 ++++++++++++ > >> mm/filemap.c | 19 +++++++++++++++++++ > >> 3 files changed, 43 insertions(+) > >> > >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm= /pgtable.h > >> index f5bf059291c3..8f8f3f7eb8d8 100644 > >> --- a/arch/arm64/include/asm/pgtable.h > >> +++ b/arch/arm64/include/asm/pgtable.h > >> @@ -1143,6 +1143,18 @@ static inline void update_mmu_cache_range(struc= t vm_fault *vmf, > >> */ > >> #define arch_wants_old_prefaulted_pte cpu_has_hw_af > >> > >> +/* > >> + * Request exec memory is read into pagecache in at least 64K folios.= The > >> + * trade-off here is performance improvement due to storing translati= ons more > >> + * effciently in the iTLB vs the potential for read amplification due= to reading > >> + * data from disk that won't be used. The latter is independent of ba= se page > >> + * size, so we set a page-size independent block size of 64K. This si= ze can be > >> + * contpte-mapped when 4K base pages are in use (16 pages into 1 iTLB= entry), > >> + * and HPA can coalesce it (4 pages into 1 TLB entry) when 16K base p= ages are in > >> + * use. > >> + */ > >> +#define arch_wants_exec_folio_order(void) ilog2(SZ_64K >> PAGE_SHIFT) > >> + > >> static inline bool pud_sect_supported(void) > >> { > >> return PAGE_SIZE =3D=3D SZ_4K; > >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > >> index 170925379534..57090616d09c 100644 > >> --- a/include/linux/pgtable.h > >> +++ b/include/linux/pgtable.h > >> @@ -428,6 +428,18 @@ static inline bool arch_has_hw_pte_young(void) > >> } > >> #endif > >> > >> +#ifndef arch_wants_exec_folio_order > >> +/* > >> + * Returns preferred minimum folio order for executable file-backed m= emory. Must > >> + * be in range [0, PMD_ORDER]. Negative value implies that the HW has= no > >> + * preference and mm will not special-case executable memory in the p= agecache. > >> + */ > >> +static inline int arch_wants_exec_folio_order(void) > >> +{ > >> + return -1; > >> +} > >> +#endif > >> + > >> #ifndef arch_check_zapped_pte > >> static inline void arch_check_zapped_pte(struct vm_area_struct *vma, > >> pte_t pte) > >> diff --git a/mm/filemap.c b/mm/filemap.c > >> index 67ba56ecdd32..80a76d755534 100644 > >> --- a/mm/filemap.c > >> +++ b/mm/filemap.c > >> @@ -3115,6 +3115,25 @@ static struct file *do_sync_mmap_readahead(stru= ct vm_fault *vmf) > >> } > >> #endif > >> > >> + /* > >> + * Allow arch to request a preferred minimum folio order for e= xecutable > >> + * memory. This can often be beneficial to performance if (e.g= .) arm64 > >> + * can contpte-map the folio. Executable memory rarely benefit= s from > >> + * read-ahead anyway, due to its random access nature. > >> + */ > >> + if (vm_flags & VM_EXEC) { > >> + int order =3D arch_wants_exec_folio_order(); > >> + > >> + if (order >=3D 0) { > >> + fpin =3D maybe_unlock_mmap_for_io(vmf, fpin); > >> + ra->size =3D 1UL << order; > >> + ra->async_size =3D 0; > >> + ractl._index &=3D ~((unsigned long)ra->size - = 1); > >> + page_cache_ra_order(&ractl, ra, order); > >> + return fpin; > >> + } > >> + } > > > > I don't know, but most filesystems don't support large mapping,even iom= ap. > > True, but more are coming. For example ext4 is in the works: > https://lore.kernel.org/all/20240102123918.799062-1-yi.zhang@huaweicloud.= com/ right, hopefully more filesystems will join. > > > This patch might negatively affect them. i feel we need to check > > mapping_large_folio_support() at least. > > page_cache_ra_order() does this check and falls back to small folios if n= eeded. > So correctness-wise it all works out. I guess your concern is performance= due to > effectively removing the async readahead aspect? But if that is a problem= , then > it's not just a problem if we are reading small folios, so I don't think = the > proposed check is correct. My point is that this patch is actually changing two things. 1. readahead index/size and async_size=3D0 2. try to use CONT-PTE for filesystems which support large mapping, we are getting 2 to help improve performance; for filesystems without large_mapping, 1 has been changed from losing read-around, /* * mmap read-around */ fpin =3D maybe_unlock_mmap_for_io(vmf, fpin); ra->start =3D max_t(long, 0, vmf->pgoff - ra->ra_pages / 2); ra->size =3D ra->ra_pages; ra->async_size =3D ra->ra_pages / 4; ractl._index =3D ra->start; page_cache_ra_order(&ractl, ra, 0); We probably need data to prove this makes no regression. otherwise, it is safer to let the code have no side effects on other file systems if we haven't data. > > Perhaps an alternative would be to double ra->size and set ra->async_size= to > (ra->size / 2)? That would ensure we always have 64K aligned blocks but w= ould > give us an async portion so readahead can still happen. this might be worth to try as PMD is exactly doing this because async can decrease the latency of subsequent page faults. #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* Use the readahead code, even if readahead is disabled */ if (vm_flags & VM_HUGEPAGE) { fpin =3D maybe_unlock_mmap_for_io(vmf, fpin); ractl._index &=3D ~((unsigned long)HPAGE_PMD_NR - 1); ra->size =3D HPAGE_PMD_NR; /* * Fetch two PMD folios, so we get the chance to actually * readahead, unless we've been told not to. */ if (!(vm_flags & VM_RAND_READ)) ra->size *=3D 2; ra->async_size =3D HPAGE_PMD_NR; page_cache_ra_order(&ractl, ra, HPAGE_PMD_ORDER); return fpin; } #endif > > I don't feel very expert with this area of the code so I might be talking > rubbish - would be great to hear from others. > > > > >> + > >> /* If we don't want any read-ahead, don't bother */ > >> if (vm_flags & VM_RAND_READ) > >> return fpin; > >> -- > >> 2.25.1 > > BTW, is it also possible that user space also wants to map some data as cont-pte hugepage? just like we have a strong VM_HUGEPAGE flag for PMD THP? Thanks Barry