from time to time our heavy Puppet users with many systems end up with huge tables:
This is because we try to store reports in normal form so it’s searchable. Essentially, report is imported line by line, hash is calculated from puppet resource and stored as source, the same for message. The lines are then stored into logs table and associated with report, message and source. Of course hash and IDs are all indexed which can lead to big index data in RDBM.
I have some optimizations in mind, but I want to discuss this with you, users and devs.
Step 1: Combine multiple records
On our production infra instance I see a lot of reports having multiline message. This is currently stored as separate source, message and log records but during import it is trivial to find out that resource did not change and we could concatenate the lines into one. The proposal is to actually perform such detection if multi-line entries. The only change would be that in report, the message column could be now multi-line (it is currently one-line only).
This is probably the easiest change, the question is how much this will help to bring the amount of records down. On our instance, we have a lot of these. Are stacktrace outputs (which are usually multi-line) common? We also present diffs in a nice way as we saw on the demo, I’d need to figure out how we process diffs. This change must not break this behavior.
|notice||/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns||Traceback (most recent call last):|
|notice||/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns||File “/bin/jenkins-jobs”, line 11, in|
|notice||File “/usr/lib/python2.7/site-packages/jenkins_jobs/cli/entry.py”, line 146, in main|
|notice||File “/usr/lib/python2.7/site-packages/jenkins_jobs/cli/entry.py”, line 140, in execute|
|notice||File “/usr/lib/python2.7/site-packages/jenkins_jobs/cli/subcommand/update.py”, line 128, in execute|
|notice||File “/usr/lib/python2.7/site-packages/jenkins_jobs/cli/subcommand/update.py”, line 89, in _generate_xmljobs|
|notice||builder = JenkinsManager(jjb_config)|
|notice||File “/usr/lib/python2.7/site-packages/jenkins_jobs/builder.py”, line 59, in init|
|notice||File “/usr/lib/python2.7/site-packages/jenkins_jobs/cache.py”, line 54, in init|
|notice||“Unable to lock cache for ‘%s’” % jenkins_url)|
|notice||jenkins_jobs.errors.JenkinsJobsException: Unable to lock cache for ‘https://ci.theforeman.org’|
|err||Puppet||‘jenkins-jobs --conf /etc/jenkins_jobs/jenkins_jobs_theforeman.org.ini update /etc/jenkins_jobs/theforeman.org > /var/cache/jjb.xml’ returned 1 instead of one of |
|err||change from ‘notrun’ to [‘0’] failed: ‘jenkins-jobs --conf /etc/jenkins_jobs/jenkins_jobs_theforeman.org.ini update /etc/jenkins_jobs/theforeman.org > /var/cache/jjb.xml’ returned 1 instead of one of |
|notice||Puppet||Applied catalog in 94.32 seconds|
Step 2: User defined storage level
Report lines can have multiple levels: debug info notice warning err alert emerg crit. Today, it is not possible to opt-out. The idea is to create a setting that would work like in logging, anything that matches the level and up would be logged, anything else dropped.
This way users could release the stress on a particular instance temporarily or permanently by ignoring let’s say debug and info messages. I do realize that notice and above for Puppet are important as they provide important details, by default Foreman would collect all the levels. This is purely opt-in.
Step 3: Store source/message hash more efficiently
We currently store calculate SHA1 from source and message and store it in hexadecimal form which is 40 bytes. Then we let the RDBM to index those two columns. Those tables can be huge and every byte saved counts.
I propose to change the type of the column from string to UUID and use UUIDv5 algorithm which is based on SHA1 and available in Rails (https://api.rubyonrails.org/classes/Digest/UUID.html) to calculate the very same. Postgres and MySQL 8.0+ supports UUID. If the data type turns out to be problem, we can use simply binary(20) to store 160bits of SHA1 directly, both RDBMs should support that. In the worst case, this is 50% less data to store and index.
I’ve already created an attempt previously, this is just more simple approach:
The biggest concern is that existing data would need to be recalculated and reindexed. This will take several hours and it can cause pain during upgrade. Since the goal is really to get rid of string (varchar) this must be revolution rather than evolution to get some effect.
If there is a volunteer to take a look on how much of index savings we can get by such a migration, that would be awesome as my instances are pretty much empty. This must be tested on dataset with millions of records to see a difference.