From: Andy Lutomirski <luto@amacapital.net>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Andy Lutomirski <luto@kernel.org>,
Nadav Amit <nadav.amit@gmail.com>, Borislav Petkov <bp@alien8.de>,
Kees Cook <keescook@chromium.org>,
Dave Hansen <dave.hansen@linux.intel.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: PCID review?
Date: Thu, 9 Feb 2017 18:46:57 -0800 [thread overview]
Message-ID: <CALCETrUiUnZ1AWHjx8-__t0DUwryys9O95GABhhpG9AnHwrg9Q@mail.gmail.com> (raw)
In-Reply-To: <20170209001042.ahxmoqegr6h74mle@techsingularity.net>
On Wed, Feb 8, 2017 at 4:10 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> On Wed, Feb 08, 2017 at 12:51:24PM -0800, Andy Lutomirski wrote:
>> On Tue, Feb 7, 2017 at 10:56 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> > Quite a few people have expressed interest in enabling PCID on (x86)
>> > Linux. Here's the code:
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/pcid
>> >
>> > The main hold-up is that the code needs to be reviewed very carefully.
>> > It's quite subtle. In particular, "x86/mm: Try to preserve old TLB
>> > entries using PCID" ought to be looked at carefully to make sure the
>> > locking is right, but there are plenty of other ways this this could
>> > all break.
>> >
>> > Anyone want to take a look or maybe scare up some other reviewers?
>> > (Kees, you seemed *really* excited about getting this in.)
>>
>> Nadav pointed out that this doesn't work right with
>> ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH. Mel, here's the issue:
>>
>
> I confess I didn't read the thread or patch, I'm only looking at this
> question.
That's fine with me :)
>
>> I want to add ASID (Intel calls it PCID) support to x86. This means
>> that "flush the TLB on a given CPU" will no longer be a particularly
>> well defined operation because it's not clear which ASID tag to flush.
>> Instead there's "flush the TLB for a given mm on a given CPU".
>>
>
> Understood.
>
>> If I'm understanding the batched flush code, all it's trying to do is
>> to flush more than one mm at a time.
>
> More or less. It would be more accurate to say it's flushing any CPU TLB
> that potentially accessed a list of pages recently. It's not flushing
> "multiple MMs at a time" as such, it's flushing "only CPUs that accessed
> pages mapped by a mm recently". The distinction may not matter.
>
> The requirements do matter though.
>
> mm/vmscan.c is where TTU_BATCH_FLUSH is set and that is processing a list
> of pages that can be mapped by multiple MMs.
>
> The TLBs must be flushed before either IO starts (try_to_unmap_flush_dirty)
> or they are freed to the page allocator (try_to_unmap_flush).
>
> To do this, all it has to track is a simple mask of CPUs, whether a flush
> is necessary and whether any of the PTEs were dirty. This is trivial to
> assemble during an rmap walk as it's a PTE check and a cpumask_or.
>
> try_to_unmap_flush then flushes the entire TLB as the cost of targetted
> a specific page to flush was so high (both maintaining the PFNs and the
> individual flush operations).
I could just maybe make it possible to remotely poke a CPU to record
which mms need flushing, but the possible races there are a bit
terrifying.
>
>> Would it make sense to add a new
>> arch API to flush more than one mm? Presumably it would take a linked
>> list, and the batched flush code would fall back to flushing in pieces
>> if it can't allocate a new linked list node when needed.
>>
>
> Conceptually it's ok but the details are a headache.
>
> The defer code would need to maintain a list of mm's (or ASIDs) that is
> unbounded in size to match the number of IPIs sent as the current code as
> opposed to a simple cpumask. There are SWAP_CLUSTER_MAX pages to consider
> with each page potentially mapped by an arbitrary number of MMs. The same
> mm's could exist on multiple lists for each active kswapd instance and
> direct reclaimer.
>
> As multiple reclaimers are interested in the same mm, that pretty much
> rules out linking them off mm_struct unless the locking would serialise
> the parallel reclaimers and prevent an mm existing on more than one list
> at a time. You could try allowing multiple tasks to share the one list
> (not sure how to find that list quickly) but each entry would have to
> be locked and as each user can flush at any time, multiple reclaimers
> potentially have to block while an IPI is being sent. It's hard to see
> how this could be scaled to match the existing code.
>
> It would be easier to track via an array stored in task_struct but the
> number of MMs is unknown in advance so all you can do is guess a reasonable
> size. It would have to flush if the array files resulting in more IPIs
> than the current code depending on how many MMs map the list of pages.
What if I just allocate a smallish array on the stack and then extend
with kmalloc(GFP_ATOMIC) as needed? An allocation failure would just
force an immediate flush, so there shouldn't be any deadlock risk.
Anyway, I need to rework the arch code to make this work at all.
Currently I'm taking a spinlock per mm when flushing that mm, but that
would mean I need to *sort* the list to flush more than one at a time,
and that just sounds nasty. I can probably get rid of the spinlock.
--Andy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-02-10 2:47 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-07 18:56 Andy Lutomirski
2017-02-07 19:11 ` Kees Cook
2017-02-07 19:24 ` Thomas Garnier
2017-02-07 19:37 ` Nadav Amit
2017-02-08 16:24 ` Andy Lutomirski
2017-02-07 21:06 ` Paul E. McKenney
2017-02-08 16:25 ` Andy Lutomirski
2017-02-08 16:52 ` Paul E. McKenney
2017-02-08 20:51 ` Andy Lutomirski
2017-02-09 0:10 ` Mel Gorman
2017-02-10 2:46 ` Andy Lutomirski [this message]
2017-02-10 11:01 ` Mel Gorman
2017-02-10 16:44 ` Andy Lutomirski
2017-02-10 21:57 ` Mel Gorman
2017-02-10 22:07 ` Andy Lutomirski
2017-02-10 22:25 ` Borislav Petkov
2017-02-10 22:58 ` Andy Lutomirski
2017-02-13 10:05 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CALCETrUiUnZ1AWHjx8-__t0DUwryys9O95GABhhpG9AnHwrg9Q@mail.gmail.com \
--to=luto@amacapital.net \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=keescook@chromium.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mgorman@techsingularity.net \
--cc=nadav.amit@gmail.com \
--cc=paulmck@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox