linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jerome Marchand <jmarchan@redhat.com>
To: Mikulas Patocka <mpatocka@redhat.com>,
	Michal Hocko <mhocko@kernel.org>,
	Ondrej Kozina <okozina@redhat.com>
Cc: Stanislav Kozina <skozina@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: System freezes after OOM
Date: Wed, 13 Jul 2016 10:35:01 +0200	[thread overview]
Message-ID: <1e31eea2-beb4-5734-c831-0c1753f0115a@redhat.com> (raw)
In-Reply-To: <alpine.LRH.2.02.1607121907160.24806@file01.intranet.prod.int.rdu2.redhat.com>


[-- Attachment #1.1: Type: text/plain, Size: 8827 bytes --]

On 07/13/2016 01:44 AM, Mikulas Patocka wrote:
> The problem of swapping to dm-crypt is this.
> 
> The free memory goes low, kswapd decides that some page should be swapped 
> out. However, when you swap to an ecrypted device, writeback of each page 
> requires another page to hold the encrypted data. dm-crypt uses mempools 
> for all its structures and pages, so that it can make forward progress 
> even if there is no memory free. However, the mempool code first allocates 
> from general memory allocator and resorts to the mempool only if the 
> memory is below limit.
> 
> So every attempt to swap out some page allocates another page.
> 
> As long as swapping is in progress, the free memory is below the limit 
> (because the swapping activity itself consumes any memory over the limit). 
> And that triggered the OOM killer prematurely.

There is a quite recent sysctl vm knob that I believe can help in this
case: watermark_scale_factor. If you increase this value, kswapd will
start paging out earlier, when there might still be enough free memory.

Ondrej, have you tried to increase /proc/sys/vm/watermark_scale_factor?

Jerome

> 
> 
> On Tue, 12 Jul 2016, Michal Hocko wrote:
> 
>> On Mon 11-07-16 11:43:02, Mikulas Patocka wrote:
>> [...]
>>> The general problem is that the memory allocator does 16 retries to 
>>> allocate a page and then triggers the OOM killer (and it doesn't take into 
>>> account how much swap space is free or how many dirty pages were really 
>>> swapped out while it waited).
>>
>> Well, that is not how it works exactly. We retry as long as there is a
>> reclaim progress (at least one page freed) back off only if the
>> reclaimable memory can exceed watermks which is scaled down in 16
>> retries. The overal size of free swap is not really that important if we
>> cannot swap out like here due to complete memory reserves depletion:
>> https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-00000/sample-00011/dmesg:
>> [   90.491276] Node 0 DMA free:0kB min:60kB low:72kB high:84kB active_anon:4096kB inactive_anon:4636kB active_file:212kB inactive_file:280kB unevictable:488kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:488kB dirty:276kB writeback:4636kB mapped:476kB shmem:12kB slab_reclaimable:204kB slab_unreclaimable:4700kB kernel_stack:48kB pagetables:120kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61132 all_unreclaimable? yes
>> [   90.491283] lowmem_reserve[]: 0 977 977 977
>> [   90.491286] Node 0 DMA32 free:0kB min:3828kB low:4824kB high:5820kB active_anon:423820kB inactive_anon:424916kB active_file:17996kB inactive_file:21800kB unevictable:20724kB isolated(anon):384kB isolated(file):0kB present:1032184kB managed:1001260kB mlocked:20724kB dirty:25236kB writeback:49972kB mapped:23076kB shmem:1364kB slab_reclaimable:13796kB slab_unreclaimable:43008kB kernel_stack:2816kB pagetables:7320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5635400 all_unreclaimable? yes
>>
>> Look at the amount of free memory. It is completely depleted. So it
>> smells like a process which has access to memory reserves has consumed
>> all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
>> context user which went off the leash.
> 
> It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior 
> to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never 
> exhausted reserved memory. With this commit, mempool allocations drop 
> __GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC, 
> they can bypass all limits).
> 
> But swapping should proceed even if there is no memory free. There is a 
> comment "TODO: this could cause a theoretical memory reclaim deadlock in 
> the swap out path." in the function add_to_swap - but apart from that, 
> swap should proceed even with no available memory, as long as all the 
> drivers in the block layer use mempools.
> 
>>> So, it could prematurely trigger OOM killer on any slow swapping device 
>>> (including dm-crypt). Michal Hocko reworked the OOM killer in the patch 
>>> 0a0337e0d1d134465778a16f5cbea95086e8e9e0, but it still has the flaw that 
>>> it triggers OOM if there is plenty of free swap space free.
>>>
>>> Michal, would you accept a change to the OOM killer, to prevent it from 
>>> triggerring when there is free swap space?
>>
>> No this doesn't sound like a proper solution. The current decision
>> logic, as explained above relies on the feedback from the reclaim. A
>> free swap space doesn't really mean we can make a forward progress.
> 
> I'm interested - why would you need to trigger the OOM killer if there is 
> free swap space?
> 
> The only possibility is that all the memory is filled with unswappable 
> kernel pages - but that condition could be detected if there is unusually 
> low number of anonymous and cache pages. Besides that - in what situation 
> is triggering the OOM killer with free swap desired?
> 
>> -- 
>> Michal Hocko
>> SUSE Labs
>>
> 
> The kernel 4.7-rc almost deadlocks in another way. The machine got stuck 
> and the following stacktrace was obtained when swapping to dm-crypt.
> 
> We can see that dm-crypt does a mempool allocation. But the mempool 
> allocation somehow falls into throttle_vm_writeout. There, it waits for 
> 0.1 seconds. So, as a result, the dm-crypt worker thread ends up 
> processing requests at an unusually slow rate of 10 requests per second 
> and it results in the machine being stuck (it would proabably recover if 
> we waited for extreme amount of time).
> 
> [  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2 0x00000000
> [  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
> [  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
> [  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
> [  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
> [  345.352536] Call Trace:
> [  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
> [  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
> [  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
> [  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
> [  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
> [  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
> [  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
> [  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
> [  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
> [  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
> [  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
> [  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
> [  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
> [  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
> [  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
> [  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
> [  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
> [  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
> [  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
> [  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
> [  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]
> [  345.352536]  [<ffffffff810cc312>] process_one_work+0x242/0x700
> [  345.352536]  [<ffffffff810cc28a>] ? process_one_work+0x1ba/0x700
> [  345.352536]  [<ffffffff810cc81e>] worker_thread+0x4e/0x490
> [  345.352536]  [<ffffffff810cc7d0>] ? process_one_work+0x700/0x700
> [  345.352536]  [<ffffffff810d3c01>] kthread+0x101/0x120
> [  345.352536]  [<ffffffff8110b9f5>] ? trace_hardirqs_on_caller+0xf5/0x1b0
> [  345.352536]  [<ffffffff818db1af>] ret_from_fork+0x1f/0x40
> [  345.352536]  [<ffffffff810d3b00>] ? kthread_create_on_node+0x250/0x250
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

  reply	other threads:[~2016-07-13  8:35 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <57837CEE.1010609@redhat.com>
     [not found] ` <f80dc690-7e71-26b2-59a2-5a1557d26713@redhat.com>
     [not found]   ` <9be09452-de7f-d8be-fd5d-4a80d1cd1ba3@redhat.com>
2016-07-11 15:43     ` Mikulas Patocka
2016-07-12  6:49       ` Michal Hocko
2016-07-12 23:44         ` Mikulas Patocka
2016-07-13  8:35           ` Jerome Marchand [this message]
2016-07-13 11:14             ` Michal Hocko
2016-07-13 14:21               ` Mikulas Patocka
2016-07-13 11:10           ` Michal Hocko
2016-07-13 12:50             ` Michal Hocko
2016-07-13 13:44               ` Milan Broz
2016-07-13 15:21                 ` Mikulas Patocka
2016-07-14  9:09                   ` Michal Hocko
2016-07-14  9:46                     ` Milan Broz
2016-07-13 15:02             ` Mikulas Patocka
2016-07-14 10:51               ` [dm-devel] " Ondrej Kozina
2016-07-14 12:51               ` Michal Hocko
2016-07-14 14:00                 ` Mikulas Patocka
2016-07-14 14:59                   ` Michal Hocko
2016-07-14 15:25                     ` Ondrej Kozina
2016-07-14 17:35                     ` Mikulas Patocka
2016-07-15  8:35                       ` Michal Hocko
2016-07-15 12:11                         ` Mikulas Patocka
2016-07-15 12:22                           ` Michal Hocko
2016-07-15 17:02                             ` Mikulas Patocka
2016-07-18  7:22                               ` Michal Hocko
2016-07-14 14:08                 ` Ondrej Kozina
2016-07-14 15:31                   ` Michal Hocko
2016-07-14 17:07                     ` Ondrej Kozina
2016-07-14 17:36                       ` Michal Hocko
2016-07-14 17:39                         ` Michal Hocko
2016-07-15 11:42                       ` Tetsuo Handa
2016-07-13 13:19           ` Tetsuo Handa
2016-07-13 13:39             ` Michal Hocko
2016-07-13 14:18               ` Mikulas Patocka
2016-07-13 14:56                 ` Michal Hocko
2016-07-13 15:11                   ` Mikulas Patocka
2016-07-13 23:53                     ` David Rientjes
2016-07-14 11:01                       ` Tetsuo Handa
2016-07-14 12:29                         ` Mikulas Patocka
2016-07-14 20:26                         ` David Rientjes
2016-07-14 21:40                           ` Tetsuo Handa
2016-07-14 22:04                             ` David Rientjes
2016-07-15 11:25                           ` Mikulas Patocka
2016-07-15 21:21                             ` David Rientjes
2016-07-14 12:27                       ` Mikulas Patocka
2016-07-14 20:22                         ` David Rientjes
2016-07-15 11:21                           ` Mikulas Patocka
2016-07-15 21:25                             ` David Rientjes
2016-07-15 21:39                               ` Mikulas Patocka
2016-07-15 21:58                                 ` David Rientjes
2016-07-15 23:53                                   ` Mikulas Patocka
2016-07-18 15:14                             ` Johannes Weiner
2016-07-14 15:29                       ` Michal Hocko
2016-07-14 20:38                         ` David Rientjes
2016-07-15  7:22                           ` Michal Hocko
2016-07-15  8:23                             ` Michal Hocko
2016-07-15 12:00                             ` Mikulas Patocka
2016-07-15 21:47                             ` David Rientjes
2016-07-18  7:39                               ` Michal Hocko
2016-07-18 21:03                                 ` David Rientjes
2016-07-14  0:01             ` David Rientjes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1e31eea2-beb4-5734-c831-0c1753f0115a@redhat.com \
    --to=jmarchan@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=okozina@redhat.com \
    --cc=skozina@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox