From: Vinayak Menon <vinmenon@codeaurora.org>
To: Minchan Kim <minchan@kernel.org>
Cc: linux-mm@kvack.org
Subject: Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
Date: Wed, 11 Sep 2019 15:37:23 +0530 [thread overview]
Message-ID: <c7fbc609-0bb0-bffd-8b1f-c2588c89bfd2@codeaurora.org> (raw)
In-Reply-To: <20190910175116.GB39783@google.com>
On 9/10/2019 11:21 PM, Minchan Kim wrote:
> On Tue, Sep 10, 2019 at 01:52:36PM +0530, Vinayak Menon wrote:
>> Hi Minchan,
>>
>>
>> On 9/10/2019 4:56 AM, Minchan Kim wrote:
>>> Hi Vinayak,
>>>
>>> On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote:
>>>> The following race is observed due to which a processes faulting
>>>> on a swap entry, finds the page neither in swapcache nor swap. This
>>>> causes zram to give a zero filled page that gets mapped to the
>>>> process, resulting in a user space crash later.
>>>>
>>>> Consider parent and child processes Pa and Pb sharing the same swap
>>>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
>>>> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
>>>>
>>>> Pa Pb
>>>>
>>>> fault on VA fault on VA
>>>> do_swap_page do_swap_page
>>>> lookup_swap_cache fails lookup_swap_cache fails
>>>> Pb scheduled out
>>>> swapin_readahead (deletes zram entry)
>>>> swap_free (makes swap_count 1)
>>>> Pb scheduled in
>>>> swap_readpage (swap_count == 1)
>>>> Takes SWP_SYNCHRONOUS_IO path
>>>> zram enrty absent
>>>> zram gives a zero filled page
>>>>
>>>> Fix this by reading the swap_count before lookup_swap_cache, which conforms
>>>> with the order in which page is added to swap cache and swap count is
>>>> decremented in do_swap_page. In the race case above, this will let Pb take
>>>> the readahead path and thus pick the proper page from swapcache.
>>> Thanks for the report, Vinayak.
>>>
>>> It's a zram specific issue because it deallocates zram block
>>> unconditionally once read IO is done. The expectation was that dirty
>>> page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true
>>> any more so I want to resolve the issue in zram specific code, not
>>> general one.
>>
>> Thanks for comments Minchan.
>>
>> Trying to understand your comment better. With SWP_SYNCHRONOUS_IO also, swap_slot_free_notify will
>>
>> make sure that it deletes the entry only if the page is in swapcache. Even in the current issue case, a valid
>>
>> entry is present in the swapcache at the time of issue (brought in by Pa). Its just that Pb missed it due to the
>>
>> race and tried to read again from zram. So thinking whether it is an issue with zram deleting the entry, or
>>
>> SWP_SYNCHRONOUS_IO failing to find the valid swapcache entry. There isn't actually a case seen where zram
>>
>> entry is deleted unconditionally, with some process yet to reference the slot and page is not in swapcache.
>>
>>
>>> A idea in my mind is swap_slot_free_notify should check the slot
>>> reference counter and if it's higher than 1, it shouldn't free the
>>> slot until. What do you think about?
>> It seems fine to me except for the fact that it will delay zram entry deletion for shared slots, which
>>
>> can be significant sometimes. Also, should we fix this path as the issue is with SWP_SYNCHRONOUS_IO missing
> It's always trade-off between memory vs performance since it could hit
> in swap cache. If it's shared page, it's likely to hit a cache next time
> so we could get performance benefit.
>
> Actually, swap_slot_free_notify is layering violation so I wanted to
> replace it with discard hint in the long run so want to go the direction.
Okay got it.
>
>> a valid swapcache entry ?
>>
>> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ?
> With your approach, what prevent below scenario?
>
> A B
>
> do_swap_page
> SWP_SYNCHRONOUS_IO && __swap_count == 1
As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read
the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now)
If so, that read itself would have deleted the zram entry ? And the read page will be in swapcache and dirty ? In that case, with SWAP_HAS_CACHE
check in the patch, B will take readahead path. And shrink_page_list would attempt a pageout to zram, for the dirty page ?
> shrink_page_list
> add_to_swap
> swap_count = 2
>
> ..
> ..
> do_swap_page
> swap_read
> swap_slot_free_notify
> zram's slot will be removed
> page = alloc_page_vma
> swap_readpage <-- read zero
>
>
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 063c0c1..a5ca05f 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **);
>> extern sector_t swapdev_block(int, pgoff_t);
>> extern int page_swapcount(struct page *);
>> extern int __swap_count(swp_entry_t entry);
>> +extern bool __swap_has_cache(swp_entry_t entry);
>> extern int __swp_swapcount(swp_entry_t entry);
>> extern int swp_swapcount(swp_entry_t entry);
>> extern struct swap_info_struct *page_swap_info(struct page *);
>> @@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry)
>> return 0;
>> }
>>
>> +static bool __swap_has_cache(swp_entry_t entry)
>> +{
>> + return 0;
>> +}
>> +
>> static inline int __swp_swapcount(swp_entry_t entry)
>> {
>> return 0;
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index e0c232f..a13511f 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> struct swap_info_struct *si = swp_swap_info(entry);
>>
>> if (si->flags & SWP_SYNCHRONOUS_IO &&
>> - __swap_count(entry) == 1) {
>> + __swap_count(entry) == 1 &&
>> + !__swap_has_cache(entry)) {
>> /* skip swapcache */
>> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>> vmf->address);
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 80445f4..2a1554a8 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry)
>> return count;
>> }
>>
>> +bool __swap_has_cache(swp_entry_t entry)
>> +{
>> + struct swap_info_struct *si;
>> + pgoff_t offset = swp_offset(entry);
>> + bool has_cache = false;
>> +
>> + si = get_swap_device(entry);
>> + if (si) {
>> + has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE);
>> + put_swap_device(si);
>> + }
>> + return has_cache;
>> +}
>> +
>> static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
>> {
>> int count = 0;
>>
>>
>>>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
>>>> ---
>>>> mm/memory.c | 21 ++++++++++++++++-----
>>>> 1 file changed, 16 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index e0c232f..22643aa 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>> struct page *page = NULL, *swapcache;
>>>> struct mem_cgroup *memcg;
>>>> swp_entry_t entry;
>>>> + struct swap_info_struct *si;
>>>> + bool skip_swapcache = false;
>>>> pte_t pte;
>>>> int locked;
>>>> int exclusive = 0;
>>>> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>
>>>>
>>>> delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>>>> +
>>>> + /*
>>>> + * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO
>>>> + * check is made, another process can populate the swapcache, delete
>>>> + * the swap entry and decrement the swap count. So decide on taking
>>>> + * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the
>>>> + * race described, the victim process will find a swap_count > 1
>>>> + * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO.
>>>> + */
>>>> + si = swp_swap_info(entry);
>>>> + if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1)
>>>> + skip_swapcache = true;
>>>> +
>>>> page = lookup_swap_cache(entry, vma, vmf->address);
>>>> swapcache = page;
>>>>
>>>> if (!page) {
>>>> - struct swap_info_struct *si = swp_swap_info(entry);
>>>> -
>>>> - if (si->flags & SWP_SYNCHRONOUS_IO &&
>>>> - __swap_count(entry) == 1) {
>>>> - /* skip swapcache */
>>>> + if (skip_swapcache) {
>>>> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>>>> vmf->address);
>>>> if (page) {
>>>> --
>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>>>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>>>
next prev parent reply other threads:[~2019-09-11 10:07 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-30 12:43 Vinayak Menon
2019-09-02 13:21 ` Michal Hocko
2019-09-03 6:13 ` Vinayak Menon
2019-09-03 11:41 ` Michal Hocko
2019-09-03 12:17 ` Vinayak Menon
2019-09-09 4:05 ` Vinayak Menon
2019-09-09 11:23 ` Michal Hocko
2019-09-09 23:26 ` Minchan Kim
2019-09-10 8:22 ` Vinayak Menon
2019-09-10 17:51 ` Minchan Kim
2019-09-11 10:07 ` Vinayak Menon [this message]
2019-09-12 17:14 ` Minchan Kim
2019-09-13 9:05 ` Vinayak Menon
2019-09-16 20:05 ` Minchan Kim
2019-09-17 5:38 ` Vinayak Menon
2019-09-18 1:12 ` Minchan Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c7fbc609-0bb0-bffd-8b1f-c2588c89bfd2@codeaurora.org \
--to=vinmenon@codeaurora.org \
--cc=linux-mm@kvack.org \
--cc=minchan@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox