we have a katello 3.0.1 install with a katello capsule in our DMZ.
we've noticed if we publish/promote content, the main katello server does
this reasonably quickly, but tasks (to sync content) on the capsule are
taking hours to complete to match the change on the main server. Does
anyone know why this is? Can it be improved?
Are you able to determine the network traffic between the server and
capsule? In general I think there are two possibilities, slow transfer of
data to the Capsule or slowness on the Capsule itself. I would recommend
checking the network speeds and the Capsule resources to see if its
constrained I/O wise or processes are eating the CPU.
Eric
···
On Tue, Jul 26, 2016 at 10:52 PM, Dylan Baars wrote:
Hi all,
we have a katello 3.0.1 install with a katello capsule in our DMZ.
we’ve noticed if we publish/promote content, the main katello server does
this reasonably quickly, but tasks (to sync content) on the capsule are
taking hours to complete to match the change on the main server. Does
anyone know why this is? Can it be improved?
I see celery and mongod on the capsule using a lot of CPU time (celery
seems to jump between 50-100%, and has been going for 435:32.54) and mongod
~30-40%, 286:18.74. Here's current memory usage on the box
free -h
total used free shared buff/cache
available
Mem: 5.7G 3.5G 166M 37M 2.0G
1.9G
Swap: 7.8G 2.4G 5.4G
and current IO
[root@wellcapsuleext foreman-proxy]# sar
Linux 3.10.0-327.el7.x86_64 (wellcapsuleext.niwa.co.nz) 07/28/2016 x86_64 (6 CPU)
12:00:01 AM CPU %user %nice %system %iowait %steal
%idle
12:10:01 AM all 0.76 0.03 0.41 0.02 0.00
98.78
12:20:01 AM all 1.11 0.03 0.51 0.01 0.00
98.35
12:30:01 AM all 0.94 0.03 0.46 0.01 0.00
98.57
12:40:01 AM all 0.77 0.03 0.41 0.01 0.00
98.78
12:50:01 AM all 1.09 0.03 0.50 0.02 0.00
98.37
01:00:01 AM all 0.95 0.03 0.47 0.01 0.00
98.54
01:10:01 AM all 0.74 0.03 0.39 0.01 0.00
98.83
01:20:01 AM all 1.07 0.02 0.48 0.02 0.00
98.40
01:30:01 AM all 0.92 0.03 0.45 0.06 0.00
98.54
01:40:01 AM all 0.77 0.03 0.41 0.01 0.00
98.79
01:50:01 AM all 1.07 0.03 0.49 0.01 0.00
98.41
02:00:01 AM all 1.00 0.03 0.49 0.08 0.00
98.41
02:10:01 AM all 0.76 0.03 0.40 0.01 0.00
98.81
02:20:01 AM all 1.09 0.02 0.50 0.01 0.00
98.38
02:30:01 AM all 0.94 0.03 0.45 0.01 0.00
98.57
02:40:02 AM all 0.79 0.03 0.42 0.01 0.00
98.76
02:50:01 AM all 1.09 0.03 0.50 0.02 0.00
98.36
03:00:01 AM all 0.93 0.03 0.46 0.01 0.00
98.57
03:10:01 AM all 0.79 0.03 0.40 0.01 0.00
98.77
03:20:01 AM all 1.09 0.03 0.50 0.01 0.00
98.37
03:30:01 AM all 0.96 0.02 0.50 0.01 0.00
98.51
03:40:01 AM all 1.15 0.06 1.96 1.39 0.00
95.44
03:50:01 AM all 1.08 0.00 0.43 0.18 0.00
98.31
04:00:01 AM all 0.99 0.00 0.40 0.18 0.00
98.43
04:10:01 AM all 0.75 0.03 0.41 0.02 0.00
98.80
04:20:01 AM all 1.06 0.03 0.50 0.03 0.00
98.39
04:30:01 AM all 0.93 0.03 0.47 0.01 0.00
98.56
04:40:01 AM all 0.74 0.03 0.39 0.02 0.00
98.83
04:50:01 AM all 1.09 0.03 0.51 0.05 0.00
98.33
05:00:01 AM all 0.91 0.03 0.45 0.02 0.00
98.60
05:10:01 AM all 0.76 0.03 0.41 0.02 0.00
98.79
05:20:01 AM all 1.06 0.03 0.49 0.02 0.00
98.41
05:30:01 AM all 0.93 0.03 0.47 0.02 0.00
98.55
05:40:01 AM all 0.74 0.02 0.40 0.02 0.00
98.82
05:50:01 AM all 1.08 0.03 0.52 0.04 0.00
98.34
06:00:01 AM all 0.96 0.02 0.47 0.03 0.00
98.52
06:10:01 AM all 0.74 0.02 0.41 0.01 0.00
98.81
06:20:01 AM all 1.06 0.03 0.49 0.01 0.00
98.41
06:20:01 AM CPU %user %nice %system %iowait %steal
%idle
06:30:01 AM all 0.93 0.02 0.46 0.01 0.00
98.58
06:40:01 AM all 0.75 0.03 0.40 0.01 0.00
98.81
06:50:02 AM all 1.07 0.02 0.50 0.02 0.00
98.38
07:00:01 AM all 0.96 0.02 0.46 0.01 0.00
98.54
07:10:02 AM all 0.77 0.02 0.42 0.01 0.00
98.78
07:20:01 AM all 1.06 0.03 0.48 0.01 0.00
98.42
07:30:01 AM all 0.94 0.03 0.47 0.01 0.00
98.56
07:40:01 AM all 0.78 0.03 0.41 0.01 0.00
98.77
07:50:01 AM all 1.06 0.03 0.50 0.01 0.00
98.41
08:00:02 AM all 0.96 0.03 0.47 0.01 0.00
98.53
08:10:01 AM all 0.75 0.02 0.40 0.01 0.00
98.82
08:20:01 AM all 1.08 0.03 0.49 0.01 0.00
98.40
08:30:02 AM all 10.48 0.02 2.18 1.58 0.00
85.73
08:40:02 AM all 9.51 0.02 2.08 1.99 0.00
86.40
08:50:01 AM all 10.72 0.02 0.97 0.73 0.00
87.56
09:00:01 AM all 10.93 0.02 1.71 1.16 0.00
86.18
09:10:01 AM all 14.54 0.02 1.99 1.94 0.00
81.52
09:20:01 AM all 14.60 0.02 2.07 1.82 0.00
81.50
Average: all 2.10 0.02 0.63 0.21 0.00
97.04
There has now been various sync jobs running for nearly 24 hours, the last
one is the below
Id: cbda7f78-dd70-48f6-ae09-67f29b0a5746
Label: Actions::Katello::Repository::CapsuleGenerateAndSync
Name: Sync Repository on Capsule(s)
Owner: foreman_admin
Execution type: Delayed
Start at: 2016-07-27 20:39:57 UTC
Start before: -
Started at: 2016-07-27 20:39:57 UTC
Ended at:
State: running
Result: -
Params: {"services_checked"=>["pulp", "pulp_auth"], "locale"=>"en"}
I'm upped the CPU count on the capsule and added some extra RAM, will see
how it goes…
Dylan
···
On Thursday, July 28, 2016 at 3:24:52 AM UTC+12, Eric Helms wrote:
>
> Dylan,
>
> Are you able to determine the network traffic between the server and
> capsule? In general I think there are two possibilities, slow transfer of
> data to the Capsule or slowness on the Capsule itself. I would recommend
> checking the network speeds and the Capsule resources to see if its
> constrained I/O wise or processes are eating the CPU.
>
>
> Eric
>
> On Tue, Jul 26, 2016 at 10:52 PM, Dylan Baars > wrote:
>
>> Hi all,
>>
>> we have a katello 3.0.1 install with a katello capsule in our DMZ.
>>
>> we've noticed if we publish/promote content, the main katello server does
>> this reasonably quickly, but tasks (to sync content) on the capsule are
>> taking hours to complete to match the change on the main server. Does
>> anyone know why this is? Can it be improved?
>>
>> Thanks!
>> Dylan
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Foreman users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to foreman-user...@googlegroups.com .
>> To post to this group, send email to forema...@googlegroups.com
>> .
>> Visit this group at https://groups.google.com/group/foreman-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> Eric D. Helms
> Red Hat Engineering
> Ph.D. Student - North Carolina State University
>
I did that, didn't seem to help. I've now done a few things:
Removed all lifecycle environments from the capsule (via the katello GUI)
Used foreman-console to delete the 2 running tasks that had been stuck
at 62% for days
ForemanTasks::Task.find("7808217a-4057-4b43-a741-0c20a6c41307").destroy
ForemanTasks::Task.find("ff0e613c-9160-49d6-bbc0-b8fcbb0f583a").destroy
This then put the capsule in a "synchronised" state - i.e. no running tasks
according to katello
Added a single lifecycle environment to the capsule
Hit Synchronise
It ran up to 48% super quickly, and now seems stuck
If I browse to the running task > running steps, there are a whole bunch of
waiting actions "Actions::Pulp::Repository::Destroy" - state is State: waiting
for Pulp to start the task
Is this a communication problem? I have the following ports open on the
katello server
443/tcp 80/tcp 8140/tcp 9090/tcp 5647/tcp 5646/tcp 5672/tcp 8080/tcp 67/udp
9000/tcp 68/udp 53/tcp 69/udp 4505-4506/tcp 53/udp 5671/tcp
and on the capsule:
443/tcp 80/tcp 8140/tcp 9090/tcp 5647/tcp 9000/tcp 8443/tcp 8000/tcp 69/udp
4505-4506/tcp
There is a physical firewall between the two servers, these ports are open
on that for the two servers to communicate
HTTP (80/tcp)
HTTPS (443/tcp)
9090/tcp
4505-4506/tcp
5646/tcp
5647/tcp
···
On Thursday, August 4, 2016 at 7:46:53 AM UTC+12, John Mitsch wrote:
>
> Dylan,
>
> I wonder if things are a little stopped up in pulp/celery on the capsule.
> Run this on the capsule:
>
> systemctl stop pulp_celerybeat.service pulp_workers.service
> pulp_resource_manager.service
> systemctl start pulp_celerybeat.service pulp_workers.service
> pulp_resource_manager.service
> systemctl restart httpd
>
> Sometimes this can help clear stuck pulp jobs. Not sure if that is your
> issue, but worth a try :)
>
> Thanks,
>
> John Mitsch
> Red Hat Engineering
> (860)-967-7285
> irc: jomitsch
>
> On Wed, Aug 3, 2016 at 3:41 PM, Dylan Baars > wrote:
>
>> Hi again,
>>
>> I've noticed in the dynflow console that katello seems to be waiting for
>> pulp to start the task -
>>
>> 12: Actions::Pulp::Consumer::SyncCapsule (waiting for Pulp to start the
>> task) [ 201571.83s / 2023.95s ] Cancel
>>
>>
>> *Started at:* 2016-08-01 00:06:17 UTC
>>
>> *Ended at:* 2016-08-03 08:05:48 UTC
>>
>> *Real time:* 201571.83s
>>
>> *Execution time (excluding suspended state):* 2023.95s
>>
>> *Input:*
>>
>> ---
>> capsule_id: 2
>> repo_pulp_id: NIWA-Oracle_Virtualbox-Virtualbox_rhel7_x86_64
>> remote_user: admin
>> remote_cp_user: admin
>> locale: en
>>
>> *Output:*
>>
>> ---
>> pulp_tasks:- exception:
>> task_type: pulp.server.managers.repo.sync.sync
>> _href: "/pulp/api/v2/tasks/20254460-c490-4669-bf02-bbcf0161172d/"
>> task_id: 20254460-c490-4669-bf02-bbcf0161172d
>> tags:
>> - pulp:repository:NIWA-Oracle_Virtualbox-Virtualbox_rhel7_x86_64
>> - pulp:action:sync
>> finish_time:
>> _ns: task_status
>> start_time:
>> traceback:
>> spawned_tasks: []
>> progress_report: {}
>> queue: None.dq
>> state: waiting
>> worker_name:
>> result:
>> error:
>> _id:
>> "$oid": 579e9279b9867873daf31b4f
>> id: 579e9279b9867873daf31b4f
>> poll_attempts:
>> total: 12369
>> failed: 0
>>
>> Does this help anyone try and figure out whats going on?
>>
>> Thanks :-)
>> Dylan
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Foreman users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to foreman-user...@googlegroups.com .
>> To post to this group, send email to forema...@googlegroups.com
>> .
>> Visit this group at https://groups.google.com/group/foreman-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
checking over the course of a few minutes showed no change, so I ran the
workaround…
[root@wellcapsuleext ~]# for i in pulp_resource_manager pulp_workers
pulp_celerybeat; do service $i restart; done
Redirecting to /bin/systemctl restart pulp_resource_manager.service
Redirecting to /bin/systemctl restart pulp_workers.service
Redirecting to /bin/systemctl restart pulp_celerybeat.service
[root@wellcapsuleext ~]#
the de-queues changed slightly, and the acquires/releases went up but only
to a point, they are now stuck here…
Well, after a few hours it did actually finish the sync - which involved
removing the lifecycle environments I had deleted) and syncing a single
lifecycle
It does seem to sit for a very long time just waiting for pulp - any idea
why?
Further, the capsule syncs seem to take swap space (even if RAM is/was
available) and never gives it back - to the point where we were publishing
content views and tomcat failed, in /var/log/foreman/production.log start
getting…
2016-08-04 14:20:15 [app] [I] Imported report for niwa-35649.niwa.local in
0.12 seconds
2016-08-04 14:20:15 [app] [W] Action failed
> Katello::Errors::ConnectionRefusedException: A backend service [
Candlepin ] is unreachable
>
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.0.2/app/lib/katello/http_resource.rb:90:in rescue in get' > /opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.0.2/app/lib/katello/http_resource.rb:80:inget'
>
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.0.2/app/lib/katello/resources/candlepin.rb:100:in get' > /opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.0.2/app/services/katello/candlepin/consumer.rb:17:inblock in <class:Consumer>'
>
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.0.2/app/lib/katello/lazy_accessor.rb:160:in instance_eval' > /opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.0.2/app/lib/katello/lazy_accessor.rb:160:inrun_initializer'
>
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.0.2/app/lib/katello/lazy_accessor.rb:139:in lazy_attribute_get' > /opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.0.2/app/lib/katello/lazy_accessor.rb:60:inblock (2 levels) in lazy_accessor'
>
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.0.2/app/services/katello/candlepin/consumer.rb:38:in entitlement_status' > /opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.0.2/app/models/katello/subscription_status.rb:40:into_status'
> /usr/share/foreman/app/models/host_status/status.rb:64:in update_status' > /usr/share/foreman/app/models/host_status/status.rb:48:inrefresh'
> /usr/share/foreman/app/models/host_status/status.rb:42:in refresh!' > /usr/share/foreman/app/models/host/managed.rb:858:inblock in
refresh_statuses'
> /opt/rh/rh-ruby22/root/usr/share/ruby/set.rb:283:in each_key' > /opt/rh/rh-ruby22/root/usr/share/ruby/set.rb:283:ineach'
and in /var/log/messages
Aug 4 14:20:07 wellkatello.niwa.local qdrouterd: Thu Aug 4 14:20:07 2016
ROUTER_LS (info) Router Link Lost - link_id=0
Aug 4 14:20:12 wellkatello.niwa.local kernel: diagnostic_con* invoked
oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0
Aug 4 14:20:12 wellkatello.niwa.local kernel: diagnostic_con* cpuset=/
mems_allowed=0-1
Aug 4 14:20:12 wellkatello.niwa.local kernel: CPU: 1 PID: 7900 Comm:
diagnostic_con* Not tainted 3.10.0-327.22.2.el7.x86_64 #1
Aug 4 14:20:12 wellkatello.niwa.local kernel: Hardware name: VMware, Inc.
VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
04/14/2014
Aug 4 14:20:12 wellkatello.niwa.local kernel: ffff880415850b80
00000000555a1641 ffff880004e3fb68 ffffffff816360fc
Aug 4 14:20:12 wellkatello.niwa.local kernel: ffff880004e3fbf8
ffffffff8163109c ffff88041ac8fc20 ffff88041ac8fc38
Aug 4 14:20:12 wellkatello.niwa.local kernel: 018804fb00000206
fffeffff00000000 0000000000000001 ffffffff81128c03
Aug 4 14:20:12 wellkatello.niwa.local kernel: Call Trace:
Aug 4 14:20:12 wellkatello.niwa.local kernel: [<ffffffff816360fc>]
dump_stack+0x19/0x1b
Aug 4 14:20:12 wellkatello.niwa.local kernel: [<ffffffff8163109c>]
dump_header+0x8e/0x214
Aug 4 14:20:12 wellkatello.niwa.local kernel: [<ffffffff81128c03>] ?
delayacct_end+0x13/0xb0
Aug 4 14:20:12 wellkatello.niwa.local kernel: [<ffffffff8116d20e>]
oom_kill_process+0x24e/0x3b0
Aug 4 14:20:12 wellkatello.niwa.local kernel: [<ffffffff81088dae>] ?
has_capability_noaudit+0x1e/0x30
Aug 4 14:20:12 wellkatello.niwa.local kernel: [<ffffffff8116da36>]
out_of_memory+0x4b6/0x4f0
Aug 4 14:20:12 wellkatello.niwa.local kernel: [<ffffffff81173c26>]
__alloc_pages_nodemask+0xaa6/0xba0
Aug 4 14:20:12 wellkatello.niwa.local kernel: [<ffffffff81078d73>]
copy_process.part.25+0x163/0x1610
Aug 4 14:20:12 wellkatello.niwa.local kernel: [<ffffffff8107a401>]
do_fork+0xe1/0x320
Aug 4 14:20:12 wellkatello.niwa.local kernel: [<ffffffff8107a6c6>]
SyS_clone+0x16/0x20
Aug 4 14:20:12 wellkatello.niwa.local kernel: [<ffffffff81646bd9>]
stub_clone+0x69/0x90
Aug 4 14:20:12 wellkatello.niwa.local kernel: [<ffffffff81646889>] ?
system_call_fastpath+0x16/0x1b
Aug 4 14:20:12 wellkatello.niwa.local kernel: Mem-Info:
Aug 4 14:20:12 wellkatello.niwa.local kernel: Node 0 DMA per-cpu:
…
…
Aug 4 14:20:12 wellkatello.niwa.local kernel: Out of memory: Kill process
23388 (java) score 50 or sacrifice child
Aug 4 14:20:12 wellkatello.niwa.local kernel: Killed process 23388 (java)
total-vm:9879424kB, anon-rss:856892kB, file-rss:0kB
Aug 4 14:20:13 wellkatello.niwa.local systemd: tomcat.service: main
process exited, code=killed, status=9/KILL
Aug 4 14:20:13 wellkatello.niwa.local puppet-master[32492]: Compiled
catalog for niwa-35517.niwa.local in environment production in 0.03 seconds
Aug 4 14:20:15 wellkatello.niwa.local server: Java virtual machine used:
/usr/lib/jvm/jre/bin/java
Aug 4 14:20:15 wellkatello.niwa.local server: classpath used:
/usr/share/tomcat/bin/bootstrap.jar:/usr/share/tomcat/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar
Aug 4 14:20:15 wellkatello.niwa.local server: main class used:
org.apache.catalina.startup.Bootstrap
Aug 4 14:20:15 wellkatello.niwa.local server: flags used:
Aug 4 14:20:15 wellkatello.niwa.local server: options used:
-Dcatalina.base=/usr/share/tomcat -Dcatalina.home=/usr/share/tomcat
-Djava.endorsed.dirs= -Djava.io.tmpdir=/var/cache/tomcat/temp
-Djava.util.logging.config.file=/usr/share/to$
Aug 4 14:20:15 wellkatello.niwa.local server: arguments used: stop
Aug 4 14:20:24 wellkatello.niwa.local pulp: celery.worker.strategy:INFO:
Received task:
pulp.server.async.tasks._queue_reserved_task[8a0acee7-a3ae-4eba-9b45-f5edb974d71e]
Aug 4 14:20:24 wellkatello.niwa.local pulp: celery.worker.strategy:INFO:
Received task:
pulp.server.managers.consumer.applicability.regenerate_applicability_for_consumers[a4f48860-5627-4931-abfa-9556309ce343]
Aug 4 14:20:24 wellkatello.niwa.local pulp: celery.worker.strategy:INFO:
Received task:
pulp.server.async.tasks._release_resource[26b8b00d-f769-4452-920f-cf350d5fafd7]
Aug 4 14:20:24 wellkatello.niwa.local server: Aug 04, 2016 2:20:24 PM
org.apache.catalina.startup.Catalina stopServer
Aug 4 14:20:24 wellkatello.niwa.local server: SEVERE: Could not contact
localhost:8005. Tomcat may not be running.
Aug 4 14:20:24 wellkatello.niwa.local pulp: celery.worker.job:INFO: Task
pulp.server.async.tasks._queue_reserved_task[8a0acee7-a3ae-4eba-9b45-f5edb974d71e]
succeeded in 0.318352089002s: None
Aug 4 14:20:24 wellkatello.niwa.local server: Aug 04, 2016 2:20:24 PM
org.apache.catalina.startup.Catalina stopServer
Aug 4 14:20:24 wellkatello.niwa.local server: SEVERE: Catalina.stop:
A reboot clears things up.
I'm not sure what to provide to try and troubleshoot this issue
I would be curious to see a foreman-debug report of this. Is there anyway
you could prepare one and send it to myself and Eric?
-John
John Mitsch
Red Hat Engineering
(860)-967-7285
irc: jomitsch
···
On Wed, Aug 3, 2016 at 11:09 PM, Dylan Baars wrote:
Further, the capsule syncs seem to take swap space (even if RAM is/was
available) and never gives it back - to the point where we were publishing
content views and tomcat failed, in /var/log/foreman/production.log start
getting…
2016-08-04 14:20:15 [app] [I] Imported report for niwa-35649.niwa.local in
0.12 seconds
2016-08-04 14:20:15 [app] [W] Action failed
Katello::Errors::ConnectionRefusedException: A backend service [
Candlepin ] is unreachable
/opt/theforeman/tfm/root/usr/share/gems/gems/katello-3.0.2/app/lib/katello/http_resource.rb:90:in
`rescue in get’