RFC: Optimize storing reports

lzap · January 11, 2019, 3:12pm

Hello,

from time to time our heavy Puppet users with many systems end up with huge tables:

reports
messages
sources
logs

This is because we try to store reports in normal form so it’s searchable. Essentially, report is imported line by line, hash is calculated from puppet resource and stored as source, the same for message. The lines are then stored into logs table and associated with report, message and source. Of course hash and IDs are all indexed which can lead to big index data in RDBM.

I have some optimizations in mind, but I want to discuss this with you, users and devs.

Step 1: Combine multiple records

On our production infra instance I see a lot of reports having multiline message. This is currently stored as separate source, message and log records but during import it is trivial to find out that resource did not change and we could concatenate the lines into one. The proposal is to actually perform such detection if multi-line entries. The only change would be that in report, the message column could be now multi-line (it is currently one-line only).

This is probably the easiest change, the question is how much this will help to bring the amount of records down. On our instance, we have a lot of these. Are stacktrace outputs (which are usually multi-line) common? We also present diffs in a nice way as we saw on the demo, I’d need to figure out how we process diffs. This change must not break this behavior.

Example:

Level	Resource	Message
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	Traceback (most recent call last):
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	File “/bin/jenkins-jobs”, line 11, in
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	sys.exit(main())
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	File “/usr/lib/python2.7/site-packages/jenkins_jobs/cli/entry.py”, line 146, in main
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	jjb.execute()
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	File “/usr/lib/python2.7/site-packages/jenkins_jobs/cli/entry.py”, line 140, in execute
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	ext.obj.execute(self.options, self.jjb_config)
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	File “/usr/lib/python2.7/site-packages/jenkins_jobs/cli/subcommand/update.py”, line 128, in execute
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	options, jjb_config)
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	File “/usr/lib/python2.7/site-packages/jenkins_jobs/cli/subcommand/update.py”, line 89, in _generate_xmljobs
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	builder = JenkinsManager(jjb_config)
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	File “/usr/lib/python2.7/site-packages/jenkins_jobs/builder.py”, line 59, in init
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	flush=jjb_config.builder[‘flush_cache’])
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	File “/usr/lib/python2.7/site-packages/jenkins_jobs/cache.py”, line 54, in init
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	“Unable to lock cache for ‘%s’” % jenkins_url)
notice	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	jenkins_jobs.errors.JenkinsJobsException: Unable to lock cache for ‘https://ci.theforeman.org’
err	Puppet	‘jenkins-jobs --conf /etc/jenkins_jobs/jenkins_jobs_theforeman.org.ini update /etc/jenkins_jobs/theforeman.org > /var/cache/jjb.xml’ returned 1 instead of one of [0]
err	/Stage[main]/Jenkins_job_builder/Jenkins_job_builder::Config[theforeman.org]/Exec[jenkins-jobs-update-theforeman.org]/returns	change from ‘notrun’ to [‘0’] failed: ‘jenkins-jobs --conf /etc/jenkins_jobs/jenkins_jobs_theforeman.org.ini update /etc/jenkins_jobs/theforeman.org > /var/cache/jjb.xml’ returned 1 instead of one of [0]
notice	Puppet	Applied catalog in 94.32 seconds

Step 2: User defined storage level

Report lines can have multiple levels: debug info notice warning err alert emerg crit. Today, it is not possible to opt-out. The idea is to create a setting that would work like in logging, anything that matches the level and up would be logged, anything else dropped.

This way users could release the stress on a particular instance temporarily or permanently by ignoring let’s say debug and info messages. I do realize that notice and above for Puppet are important as they provide important details, by default Foreman would collect all the levels. This is purely opt-in.

Step 3: Store source/message hash more efficiently

We currently store calculate SHA1 from source and message and store it in hexadecimal form which is 40 bytes. Then we let the RDBM to index those two columns. Those tables can be huge and every byte saved counts.

I propose to change the type of the column from string to UUID and use UUIDv5 algorithm which is based on SHA1 and available in Rails (Digest::UUID) to calculate the very same. Postgres and MySQL 8.0+ supports UUID. If the data type turns out to be problem, we can use simply binary(20) to store 160bits of SHA1 directly, both RDBMs should support that. In the worst case, this is 50% less data to store and index.

I’ve already created an attempt previously, this is just more simple approach:

github.com/theforeman/foreman

Fixes #22876 - message/source digests are bigint

theforeman:develop ← lzap:xxhash-22876

opened 02:03PM - 13 Mar 18 UTC

lzap

+102 -4

This patch converts digest columns from SHA1 HEX (40 characters) to BIGINT (8 by…tes). Instead using secure SHA1 hash, it now uses very fast XXHash which aims for speed and delivers high quality entropy. Although the benchmark included in the patch shows small performance improvement on sqlite3, the main reason for this page is saved storage on PostgreSQL server - instead storing 40 characters we only need 8 bytes (64 bits). Searching BIGINT via indexes is faster than VARCHAR in every aspect, so there will be performance boost seen on deployments with millions of records in those tables. Please do not merge until this performance boost is confirmed. To test this patch, please try it on production instance (PostgreSQL/MySQL) with tens of millions of records in message and/or source tables. The benchmark test can be used to compare before/after.

The biggest concern is that existing data would need to be recalculated and reindexed. This will take several hours and it can cause pain during upgrade. Since the goal is really to get rid of string (varchar) this must be revolution rather than evolution to get some effect.

If there is a volunteer to take a look on how much of index savings we can get by such a migration, that would be awesome as my instances are pretty much empty. This must be tested on dataset with millions of records to see a difference.

Other ideas?

lzap · January 15, 2019, 11:20am

I would like to ask if you could share your average length of messages and sources from your database. Please do:

select avg(length(value)) from messages;
select avg(length(value)) from sources;

And also state what reports do you mostly have in your database (Puppet, Ansible, SCAP). Thanks.

Edit: I would be also interested in few examples of the longest message and/or source values.

lzap · January 15, 2019, 12:37pm

My thinking:

Postgres supports hash index type which does not sort records in any way but it’s much faster. Since we never iterate through messages and sources tables and only do finds, hash is an excellent choice. Index size is more than 50% smaller than regular b-tree and it’s also faster.

Postgres devs were discouraging use of hash indexes for a decade now, but things are changing and 9.0+ and 10+ versions are now actually better than it was.
The major drawback is if you have postgres 9 or older, replication would not work with hash indexes. Upgrading to postgres 10 solves the problem tho, so this should not be blocker.

MySQL pretty much supports it for some time and there are no limitations to my knowledge.

Here is how messages and sources tables look like today:

foreman=# select * from messages;
 id |                    value                    |                  digest                  
----+---------------------------------------------+------------------------------------------
  1 | Retrieving pluginfacts                      | 3335b7faa7aac6cb2889a91fa9feeeb03fe5bd20
  2 | Retrieving plugin                           | e461f73662588ea80f734eb53ab03104541ac4ea
  3 | Caching catalog for lloyd-albanese.nat.lan  | bdeaac3e5d4b675603ffe3c7c401da7ca5cec475
  4 | Applying configuration version '1541752242' | c2441240d2954804589ba6027bbe57d5d313d039
  5 | Applying configuration version '1541753055' | 18ff3675cee4e5a51e9cdad25812cc287103859e
  6 | Applying configuration version '1541754961' | e644f9e25c3a5ad23a00f3d74a3c59606b8e0c9f
  7 | Applying configuration version '1541757425' | 11f4b150f1e40edddd9144ee624e9b21e3f12585
  8 | Applying configuration version '1542038281' | 5627e0a89a2f1a94f2979aa85ca40ed82ac6c2db
  9 | Applying configuration version '1542039451' | 01ab1170b3e50582a571104c5dbdd23c320ddea8
 10 | Applying configuration version '1542040332' | 778b81022b15a0c286992326e97af825203ce1e8

The digest is SHA1 hex encoded digest of the value itself, it has a b-tree index.

Proposal 4: Drop digests and use index on value

Obviously, we only search via a digest when creating new records, then the records are associated via ID keys. The same job would do index on value column. It looks like for small value lengths (“Test message 1234567” = up to 20 characters), b-tree performs better than using explicit SHA1 digest:

I created million of records:

create extension pgcrypto;

do $$
begin
for r in 1..10000000 loop
insert into messages (value, digest) values('Test message ' || r, encode(digest('Test message ' || r, 'sha1'), 'hex'));
end loop;
end;
$$;

The created an b-tree index on values:

create index ex_messages_value on messages (value);

And then created two transactions:

$ cat pgbench-messages.sql
\setrandom ix 1 10000000
select * from messages where digest = encode(digest('Test message ' || :ix, 'sha1'), 'hex');

$ cat pgbench-messages-value.sql 
\setrandom ix 1 10000000
select * from messages where value = 'Test message ' || :ix;

Pgbench is giving similar results:

bash-4.2$ pgbench -f pgbench-messages-value.sql -n foreman -c4 -j2 -t5000
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 2
number of transactions per client: 5000
number of transactions actually processed: 20000/20000
tps = 14523.579394 (including connections establishing)
tps = 14570.913906 (excluding connections establishing)

bash-4.2$ pgbench -f pgbench-messages.sql -n foreman -c4 -j2 -t5000
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 2
number of transactions per client: 5000
number of transactions actually processed: 20000/20000
tps = 13588.045781 (including connections establishing)
tps = 13635.677204 (excluding connections establishing)

In this scenario, a simple index actually performs better than calculating SHA1, encoding it into HEX and using that via an index. It’s TPS is 13k vs 14k.

Edit: I tried the same with 200000 records of a length of 1000 characters, here are the results:

hash index on value being the fastest: 18k TPS
b-tree index on digest (current implementation): 15k TPS
b-tree index on value the slowest: 6k TPS