On Fri, 25 Jul 2025, Baolin Wang wrote:
> On 2025/7/25 12:47, Hugh Dickins wrote:
> > On Fri, 25 Jul 2025, Baolin Wang wrote:
> >>>
> >>> I hope to correct the logic of i915 driver's shmem allocation, by
> >>> extending
> >>> the shmem write length in the i915 driver to allocate PMD- sized THPs.
> >>> IIUC,
> >>> some sample fix code is as follows (untested). Patryk, could you help test
> >>> it to see if this resolves your issue? Thanks.
> > 
> > This patch cannot be the right fix.  It may be a very sensible workaround
> > for some in-kernel drivers (I've not looked or tried); but unless I
> > misunderstand, it does nothing to restore userspace behaviour on a
> > huge=always tmpfs.
> 
> Yes. Initially, we wanted to maintain compatibility with the 'huge=' option,
> meaning that 'huge=always' tmpfs mount would still allocate PMD-sized THPs.
> However, the current implementation is the consensus we reached after much
> debate:
> 
> 1. “When using tmpfs as a filesystem, it should behave like other filesystems.
> No more special mount options.” Per Matthew.

That's okay, I've not proposed a new mount option at all (though that is
rather how "never" came to end up meaning "not usually": our shared dislike
for adding yet more options).  I'm proposing (shock horror) respecting the
long-standing meaning of "huge=always".

> 2. “Do not let the 'huge=' mount option mean 'PMD-sized' when other sizes
> exist.” Per David.

That's less obvious.  The collision in tmpfs between anon mTHP, file large
folio, and huge mount option (where shmem_enabled in sysfs provides that
mount option for the internal mounts) is certainly difficult to resolve
in any way pleasing to all (or any) of us.

But what remains clear is that we should not degrade the behaviour of
"huge=always" for existing users: they were given PMD-sized when possible
before, and they should be given PMD-sized when possible now (not suited
to all usages, when "huge=within_size" may be more suitable).

> 
> At the time, we should have sought your advice, but we failed. The long
> historical discussion is in this thread[1]. So now the strategy for tmpfs
> supporting large folios is:

Yes, it's a pity how limited and unresponsive I am, then and now and forever;
but the principle of not regressing userspace is not a topic on which my
special input should be needed.

> 
> "
> Considering that tmpfs already has the 'huge=' option to control the PMD-sized
> large folios allocation, we can extend the 'huge=' option to allow any sized
> large folios. The semantics of the 'huge=' mount option are:
> huge=never: no any sized large folios
> huge=always: any sized large folios
> huge=within_size: like 'always' but respect i_size
> huge=advise: like 'always' if requested with madvise()
> 
> Note: For tmpfs mmap() faults, due to the lack of a write size hint, still
> allocate the PMD-sized large folios if huge=always/within_size/advise is set.
> 
> Moreover, the 'deny' and 'force' testing options controlled by
> '/sys/kernel/mm/transparent_hugepage/shmem_enabled' still retain the same
> semantics. The 'deny' can disable any sized large folios for tmpfs, while the
> 'force' can enable PMD sized large folios for tmpfs.
> "

Thanks for the summary, I'll have to come back to it another time: on
first reading, it is not incompatible with "huge=always" always trying
for PMD-sized, but falling back to smaller large folios when unsuccessful.

(I'll mention in passing that I find it strange the way shmem is getting
large folios of a selected subset of sizes from one direction, but large
folios of all possible sizes from another direction - often dependent
on whether i_nlink is 0 at the time, but maybe not.  My own preference,
so long as those tunings exist, is that shmem should always be restricted
to the selected subset of sizes; but I may well alienate everyone I've
not already annoyed with that opinion, and it's probably "not a hill I'm
prepared to die on", nor even directly relevant here - except that I'd
better mention that unhappiness while I'm in the area.)

> 
> Currently, we have observed regression in the i915 driver but have not yet
> seen userspace regression on a huge=always tmpfs.

I shall not object to a temporary workaround to suit the i915 driver; but
insist it not be taken as excuse not to fix the userspace regression later.

> 
> If you have better suggestions, please feel free to point them out. Thanks.

Sounds like you're disinclined to fix it yourself, and I'll lose the
argument if it's not fixed during this cycle (since 6.17-next will become
6.18 LTS); so I'd better carve out the time to get into it in coming weeks.

Hugh

> 
> [1] https://lore.kernel.org/lkml/Zw_IT136rxW_KuhU@casper.infradead.org/
> 
> > Please reread my comment earlier in the thread, in particular,
> > Passing a new SIGBUS xfstest does not excuse a regression: strict PAGE_SIZE
> > SIGBUS behaviour is fine for the newly-featured mTHPs or large folios,
> > but not for the long-established huge=always.