Dev server hangs on code change

lzap · January 30, 2018, 12:44pm

Does it really hang for real, or do you see logs no longer available?

We are tracking a bug (after Rails 5 upgrade) that logger silencer, a terrible feature in Rails logging stack, is no longer compatible with our Logging gem stack. After some time, logging resets to ERROR level, which is not useful in dev mode of course. I am working on a patch currently:

github.com/theforeman/foreman

Fixes #22354 - Make SilencedLogger thread safe

theforeman:develop ← lzap:concurrent-silenced-logger-22354

opened 11:11AM - 22 Jan 18 UTC

lzap

+78 -12

Rails logger silencer code was thread safe for some time: https://github.com/ra…ils/rails/commit/629efb605728b31ad9644f6f0acaf3760b641a29 But this has been refactored in Rails 5.0 and thread safety is now part of logger instead of silencer: https://github.com/rails/rails/commit/2518bda97cbbcb33dc9a92e70d5b01c09e64d12d The patch made assumption that all Rails applications are using Rails logging stack, but that's not the case for Foreman - we rely on much more flexible "logging" gem (https://github.com/TwP/logging) which does not implement this type of thread safety for levels. Foreman has its own logging delegator called SilencedLogger and the code there was never thread safe, concurrent sprockets requests are now using silence method which is not thread safe at all, with logging gem this leads to level set to ERROR (the default) after few requests in development mode which renders Foreman unusable (no debug logging output). This patch uses the same technique and stores temporary levels into thread-only context, so on concurrent access there are no incorrect readouts.

thomasmckay · January 30, 2018, 1:01pm

Yes, I have seen the logging stop; thanks for working on that! The hang is different.

Pausing debugger more often today, I see it quite often is in CandlepinListeningService poll_for_messages

github.com

Katello/katello/blob/master/app/lib/actions/candlepin/candlepin_listening_service.rb#L90


rescue ::Actions::Candlepin::ConnectionError => e
  {:result => nil, :error => e.message}
end


def poll_for_messages(suspended_action)
  @thread.kill if @thread
  @thread = Thread.new do
    count = 0
    loop do
      begin
        message = fetch_message
        if message[:result]
          result = message[:result]
          suspended_action.notify_message_received(result.message_id, result.subject, result.content)
        elsif message[:error]
          suspended_action.notify_not_connected(message[:error])
          break
        end


        if count >= LOOP_RECEIVE_COUNT || (message[:result].nil? && message[:error].nil?)
          sleep 1

lzap · January 30, 2018, 1:12pm

I think dynflow was integrated into rails process now, I guess that’s it. @inecas ?

iNecas · February 1, 2018, 3:57pm

The dynflow has been integrated into the Rails process from the day one in development. Nothing changed there recently.

There has been some issues observed after migration to Rails 5, that were connected to the changes that were introduced by Rails making the code reloading thread safe (see more details here Bug #21173: class loading leads to deadlocks on rails 5 - foreman-tasks - Foreman).

The fact, that the trace point to the code you’re sending might be just due to the fact that this is running in separate thread, and it’s always there: it doesn’t have to mean it’s actually related.

I’ve seen some hangs recently in jenkins, that were related to the RSS notifications checker Refactor #22272: Don’t run CreateRssNotifications in tests - Foreman. Maybe trying to comment the code out that initiates it (see the issue for where the code resides), could point you to the right direction.

ehelms · February 2, 2018, 6:35pm

I tried to do some digging and diff testing with this. Commenting out various things, trying to use Dynflow as a remote executor (which I could not get to work in a dev environment) what I was able to single it down to is related to what Tom said. The always running dynflow tasks seem to be related to the issue.

If I comment out the following code in lib/katello/engine.rb:

    initializer "katello.initialize_cp_listener", :before => :finisher_hook do
      unless ForemanTasks.dynflow.config.remote? || File.basename($PROGRAM_NAME) == 'rake' || Rails.env.test?
        ForemanTasks.dynflow.config.on_init do |world|
          ::Actions::Candlepin::ListenOnCandlepinEvents.ensure_running(world)
          ::Actions::Katello::EventQueue::Monitor.ensure_running(world)
        end
      end
    end

I do not get server hangs anymore when I edit code. With this chunk present, I can replicate the freeze almost 100% of the time. Is it possible something changed in either dynflow, foreman-tasks or the way Foreman is initializing Dynflow that leads to this? Is the Katello code potentially out of date?

ehelms · February 2, 2018, 9:13pm

I am yet again seeing the production nightly failure as well. And it appears related to this same area? I am sorta just taking shots at what might be the issues, but given it is in the same area maybe these are all related. After a few service restarts I see the following on my centos7-katello-bats-ci:

ehelms · February 3, 2018, 2:03am

A few more things I noticed that may or may not be factors.

In dev environment:

The standard dev environment comes with a development database, production database and test database. The development and production definitions have the same database name. If I rename the production database to say katello_prod and do a rake katello:reset I do not encounter this issue. If I keep them the same name, do a rake katello:reset I do hit the issue.

In production environment:

The default production install configures a test and development sqlite database in database.yml (the databases do not actually seem to exist). If I remove these from the file, and do a fresh install with only the production postgresql database defined all bats test run clean and I have no weirdness with the always running tasks.

lzap · February 5, 2018, 12:09pm

Honestly, I don’t understand why we are running anything in a thread, both Webrick and Puma are multi-threaded by default now and this can cause lot of pain as our code is not safe. I’d rather see dynflow being started as an external process until we are sure our Rails app is thread-safe.

thomasmckay · February 6, 2018, 12:58pm

I changed the name of prod db on a fresh dev install and ran katello:reset. I can edit controllers w/o issue but editing an “action” class hangs still.

John_Mitsch · March 19, 2018, 6:38pm

Since we are still battling this, I looked into this a bit. Unfortunately, I wasn’t able to pinpoint the problem, but I’ll share my digging in case it helps anyone else.

@thomasmckay pointed me to a similar issue in graphql-ruby that seemed hopeful. That did result in a fix in rails that isn’t in 5.1.4, so I tried applying the patch without luck. It looks like we don’t even use the ActiveSupport::Reloader middleware mentioned in the issue. (please correct me if this isn’t true)

However, we do use ActionDispatch::Reloader, so I tried the same technique mentioned in the issue of figuring out if that is the middleware causing the issue

diff --git a/config/application.rb b/config/application.rb
index b4e31ae..94cca03 100644
--- a/config/application.rb
+++ b/config/application.rb
@@ -1,6 +1,22 @@
 if defined?(Rake.application) && Rake.application.top_level_tasks.grep(/jenkins/).any?
   ENV['RAILS_ENV'] ||= 'test'
 end
+
+class WhatTheHeckMiddleware 
+  def initialize(app, opts = {}) 
+    @app = app 
+    @idx = opts[:idx]
+  end 
+
+  # Just log when this middleware starts and ends
+  def call(env) 
+    puts "~~ wth START #{@idx} ~~"
+    res = @app.call(env)
+    puts "~~ wth END #{@idx} ~~"
+    res 
+  end 
+end 
+
 require File.expand_path('../boot', __FILE__)
 require 'apipie/middleware/checksum_in_headers'
 require 'rails/all'
@@ -147,6 +163,8 @@ module Foreman

     # enables JSONP support in the Rack middleware
     config.middleware.use Rack::JSONP if SETTINGS[:support_jsonp]
+    config.middleware.insert_before(ActionDispatch::Reloader, WhatTheHeckMiddleware, { idx: 1 })
+    config.middleware.insert_after(ActionDispatch::Reloader, WhatTheHeckMiddleware, { idx: 2 })

I reproduced the issue, and the server did hang without printing the END statements, so I’m pretty sure something is going on in ActionDispatch::Reloader. I have no idea if its on the Rails side or ours. I debugged through the rails code and quickly found myself lost, scared, and more confused. If anyone wants to take that up, here are the files I found of interest:

~/.rvm/gems/ruby-2.4.1/gems/actionpack-5.1.4/lib/action_dispatch/middleware/executor.rb
~/.rvm/gems/ruby-2.4.1/gems/activesupport-5.1.4/lib/active_support/execution_wrapper.rb
~/.rvm/gems/ruby-2.4.1/gems/activesupport-5.1.4/lib/active_support/callbacks.rb

I have no idea if this is helpful or even on the right track, but maybe it will bring us closer to a solution, since this is quite an annoying issue and doesn’t seem to be disappearing. I’m open to ideas to debug further down this path or others.

Regarding the discussion around dynflow, is there a way we can run katello without dynflow to confirm or deny that this could be causing the server to hang?

We could also try upgrading to one of the rails 5.2 RCs and see if that changes anything

John_Mitsch · March 23, 2018, 7:17pm

I looked into this more and was able to get some more info and come up with a (less-than-ideal) workaround. We still need to come up with an actual solution to this. I’m going to post what I’ve found so far in detail, in hopes that this will help lead us a permanent solution.

I’ll go into what I found, but first, here is the workaround for those interested. It is modifying ActiveRecord itself. DISCLAIMER: I have no idea what sort of implications this change has, use at your own risk

Workaround: comment out this line in active record. For me that meant going to /home/vagrant/.rvm/gems/ruby-2.4.1/gems/activerecord-5.1.4/lib/active_record/railtie.rb and commenting out ActiveRecord::Base.clear_reloadable_connections! around line 153.

What I’ve found so far: I was able to get some excellent help from a user named dminuoso on #RubyOnRails (freenode). He suggested using debug locks to get some more information. This meant adding config.middleware.insert_before Rack::Sendfile, ActionDispatch::DebugLocks to config/environments/development.rb, restarting the server, reproducing the issue, and then pointing my browser to <server>/rails/locks.

This showed a thread locking two other threads (in some cases it was 3 others) with a traceback. This traceback showed the clear_reloadable_connections! method above, which led me to the workaround.

As far as what is creating that lock, it was suggested any code using lock! or AR::Base#connection_pool could be causing this. I see two instances in Dynflow one two that are using connection_pool, but I couldn’t confirm they are causing this lock.

Doing some more debugging, I found this SQL statement executing right before the lock

SELECT  "katello_events"."object_id" FROM   
"katello_events" WHERE "katello_events"."in_progress" = $1 ORDER BY "katello_events"."created_at"
ASC LIMIT $2

which would indicate the katello event queue is to blame, and goes along with @ehelms’s theory in a previous comment.

Some devs have had luck commenting out the event queue as well. Its not foolproof in my experience, probably because it can restart on its own.

There is an issue in rails that possibly could be related, where they suggest to use ActiveSupport::Dependencies.interlock.permit_concurrent_loads to fix deadlocks. I was suggested this on #RubyOnRails as well.

Hopefully this helps us pinpoint the exact issue further, let me know if you have any questions about what I’ve posted. Lets beat this pesky issue

John_Mitsch · March 23, 2018, 9:28pm

I found out some more about this with a potential fix:

It looks like there is a very similar issue in rails to what we are seeing with the threads. This may be some help to us.

An easier (and much better than the previous) workaround is to change the Katello event queue polling to a higher number. I used 100 and wasn’t able to reproduce the server hang.

I submited a PR in wrapping the event queue block in Rails.application.executor.wrap, which seems to fix the issue (someone please confirm) and should be present in as each thread should be wrapped before it runs application code according to rails docs.

Hopefully this is a permanent fix, I will continue to investigate this and potential underlying issues so we aren’t just putting a band-aid over things.

lzap · March 26, 2018, 10:57am

Thanks for the investigation.

To this date, I do not understand how we ended up running dynflow in another thread in our thread unsafe Rails application. What blocked us from running it separately from foreman init gem? We already have node, 2 or 3 is no difference.

ohadlevy · March 26, 2018, 11:07am

I agree with you - I did try at
https://github.com/theforeman/foreman/pull/4871

Marek_Hulan · March 27, 2018, 5:50am

Just to explain, dynflow was used this way before webpack was added. The idea was, that folks who do not know, what dynflow is, shoulndn’t need to care. Long years it worked fine and devs didn’t know they are running executor in their environment. Adding yet another separate service will increase new dev barrier, like webpack did. On the other hand, it will be closer to production like env.

I’m not against extracting it, but I think the in process executor was a good idea. The fact there’s a bug now, does not change it IMHO. By extracting we’ll lose the autoreloading, which seems to cause these problems. I guess for running test, we’ll continue using the in process executor?

TimoGoebel · March 27, 2018, 8:33am

Me, too. I think the benefits of this being closer to how this is done in production outweigh the odds.

lzap · March 27, 2018, 1:53pm

This discussion is not about if embedded dynflow for development setup is nice or not. My long-term opinion is to be as close as possible with development setup so I’d be against it anyway.

This discussion is all about spawning new threads in our thread non-safe Rails application (*). This is a no go, Dynflow must be removed from our Rails app ASAP, extraction into Procfile seems to be the best option now.

I am not familiar with Dynflow initialization, but I am willing to test the change once the PR is up.

(*) Although Rails 4.x was considered thread safe, this applies to clean application without any dependencies. We have many and it’s is actually not realistic to prove all our deps are safe. And it is currently also irrelevant - our current deployment is a forking server anyway, so one thread per instance is just enough.

Marek_Hulan · March 27, 2018, 6:15pm

Perhaps I misunderstood how the built-in executor worked, but I though it runs in the same thread. I don’t think it would spawn new thread in this mode. But I defer to devs who are more familiar with that part. I just wanted to slow down “let’s blame dynflow” direction, that I felt from the discussion. I’m sorry if I read that wrong. I’m all for setup, which is closer to how it works in production. But keep in mind test env.

iNecas · March 27, 2018, 7:35pm

There are two topics here:

running Dynflow or not in as part of the web server in development setup. Given the generic preferences that folks expressed, I’m not opposed against this move. The only reason was just a convenience.
using thread pool in Dynflow executor: Dynflow uses multiple threads inside process (in both devel and production setup), and spreads work across this thread pool. This is a design decision form the very beginning and there is no way we’re going to move away from this model. There have not been that many thread-safeness issues so far and we’ve been able to resolve all of them. I have different suggestion: please don’t count on thread unsafeness in general, and if you do, please stop ASAP, this is a no go :). This is not something that would be specific just to Dynflow. E.g. sidekiq and others use this model as well,

Please note that the whole issue in this thread talks about development setup, which might be something completely unrelated to production.

Could you be more specific about which dependencies are you talking about? As I already described, one thread per instance is unacceptable approach for me.

Btw. given I don’t see the server hangs on Foreman + Remote Execution and Katello uses additional threads outside of the Dynflow pool, I would recommend looking also there.

lzap · March 28, 2018, 7:40am

I think this is important topic, so I opened new thread, I will continue there.