Scoped search vs elastic performance comparison

Hello All,

As part of moving more Katello entities to scoped search I did some
performance comparisons Using a nightly install which is using scoped
search for errata on postgresql and katello 2.0 which uses elasticsearch.

Both systems were using 6 GB ram, with 2 cpus each on a vm running on
the same host. I loaded 200,000 fake errata into the database with a
bz, cve, and three packages each and created 5 repositories with 20,000
errata each. I then wrote a few different queries that users would
likely perform returning the total count and 20 items for each query and
executed them10 times averaging the results (time in seconds):

Scoped Search:

Errata: 0.0041088655
Errata count: 0.1569917291
Errata type filter: 0.004434209000000001
Errata type filter count: 0.0766710245
Errata type search: 0.0042759371
Errata type search count: 0.07862921749999999
Errata Package name: 0.39945394970000003
Errata Package name count: 0.38841846230000004

Elastic Search:

Errata: 0.07552989400000001
Errata count: 0.06409571280000001
Errata type filter: 0.0291230768
Errata type filter count: 0.06429808210000001
Errata type search: 0.0739272778
Errata type search count: 0.018988432399999998
Package name: 0.1048855049
Package name count: 0.0442273977

Note that initially the scoped search queries were much much slower and
required a good deal of optimization adding various indexes. Prior to
the optimization, most of the scope search queries were in the .5s to 1s
range. Indexes already existed but were not adequate to achieve these
final performance numbers. Due to manner that scoped search allows the
user to search with various columns, the number of indexes required
would also increase and it may be difficult or impossible to provide
this level of performance for all queries from the user. I don't think
this is too terrible as default queries used throughout the app should
be able to optimized and detailed user queries being a bit slower would
be acceptable. Also note that very little postgresql server optimization
was done other than bumping shared_buffers & effected_cache_size in
postgresql.conf to around 200MB. This could be increased further and
further optimizations could be performed.

No optimizations were done to elasticsearch.

Conclusion:

Scoped search is sufficient for our needs today for katello entities and
entities purely in backend systems (such as packages and errata).
Looking to the future if we aim to scale to a million or more hosts (for
example), we likely would want to consider more loosely integrating
elasticsearch in an optional manner for just entities that need it if
postgresql fails to perform well enough.

Let me know if you have any questions

-Justin

PS

The code for the queries themselves are:

def scope_tests
[
time("Errata"){Katello::Erratum.in_repositories(Katello::Repository.all[0…3]).order('updated').limit(20).all},
time("Errata
count"){Katello::Erratum.in_repositories(Katello::Repository.all[0…3]).count(:distinct
=> true)},
time("Errata type
filter"){Katello::Erratum.in_repositories(Katello::Repository.all[0…3]).where(:errata_type
=> :security).order('updated').limit(20).all},
time("Errata type filter
count"){Katello::Erratum.in_repositories(Katello::Repository.all[0…3]).where(:errata_type
=> :security).count(:distinct => true)},
time("Errata type
search"){Katello::Erratum.in_repositories(Katello::Repository.all[0…3]).search_for("type
= security").order('updated').limit(20).all},
time("Errata type search
count"){Katello::Erratum.in_repositories(Katello::Repository.all[0…3]).search_for("type
= security").count(:distinct => true)},
time("Package
name"){Katello::Erratum.in_repositories(Katello::Repository.all[0…3]).search_for("package_name
~ a*").order('updated').limit(20).all},
time("Package name
count"){Katello::Erratum.in_repositories(Katello::Repository.all[0…3]).search_for("package_name
~ a*").count(:distinct => true)}
]
end

def es_tests
[
time("Errata"){ Katello::Errata.search{ size 20; query{all}; filter
:and, [{:terms => {:repoids =>
Katello::Repository.pluck(:pulp_id)[0…2]}}]} },
time("Errata count"){ Katello::Errata.search{ query{all}; filter
:and, [{:terms => {:repoids =>
Katello::Repository.pluck(:pulp_id)[0…2]}}]}.total },
time("Errata type filter"){ Katello::Errata.search{ size 20;
query{all}; filter :and, [{:terms => {:type => [:security]}}, {:terms =>
{:repoids => Katello::Repository.pluck(:pulp_id)[0…2]}}]} },
time("Errata type filter count"){ Katello::Errata.search{ query{all};
filter :and, [{:terms => {:type => [:security]}}, {:terms => {:repoids
=> Katello::Repository.pluck(:pulp_id)[0…2]}}]}.total },
time("Errata type search"){ Katello::Errata.search{ size 20;
query{string 'type:security'}; filter :terms, {:repoids =>
Katello::Repository.pluck(:pulp_id)[0…2]}} },
time("Errata type search count"){ Katello::Errata.search{
query{string 'type:security'}; filter :terms, {:repoids =>
Katello::Repository.pluck(:pulp_id)[0…2]}}.total },
time("Package name"){ Katello::Errata.search{ size 20; query{string
'pkglist.packages.name:a'}; filter :terms, {:repoids =>
Katello::Repository.pluck(:pulp_id)[0…2]}} },
time("Package name count"){ Katello::Errata.search{ query{string
'pkglist.packages.name:a'}; filter :terms, {:repoids =>
Katello::Repository.pluck(:pulp_id)[0…2]}}.total }
]
end

thumbs up!

··· On 06.11.14 20:34, Justin Sherrill wrote: > Hello All, > > As part of moving more Katello entities to scoped search I did some > performance comparisons Using a nightly install which is using scoped > search for errata on postgresql and katello 2.0 which uses elasticsearch. > > Both systems were using 6 GB ram, with 2 cpus each on a vm running on > the same host. I loaded 200,000 fake errata into the database with a > bz, cve, and three packages each and created 5 repositories with 20,000 > errata each. I then wrote a few different queries that users would > likely perform returning the total count and 20 items for each query and > executed them10 times averaging the results (time in seconds): > > Scoped Search: > > Errata: 0.0041088655 > Errata count: 0.1569917291 > Errata type filter: 0.004434209000000001 > Errata type filter count: 0.0766710245 > Errata type search: 0.0042759371 > Errata type search count: 0.07862921749999999 > Errata Package name: 0.39945394970000003 > Errata Package name count: 0.38841846230000004 > > > Elastic Search: > > Errata: 0.07552989400000001 > Errata count: 0.06409571280000001 > Errata type filter: 0.0291230768 > Errata type filter count: 0.06429808210000001 > Errata type search: 0.0739272778 > Errata type search count: 0.018988432399999998 > Package name: 0.1048855049 > Package name count: 0.0442273977 > > > Note that initially the scoped search queries were much much slower and > required a good deal of optimization adding various indexes. Prior to > the optimization, most of the scope search queries were in the .5s to 1s > range. Indexes already existed but were not adequate to achieve these > final performance numbers. Due to manner that scoped search allows the > user to search with various columns, the number of indexes required > would also increase and it may be difficult or impossible to provide > this level of performance for all queries from the user. I don't think > this is too terrible as default queries used throughout the app should > be able to optimized and detailed user queries being a bit slower would > be acceptable. Also note that very little postgresql server optimization > was done other than bumping shared_buffers & effected_cache_size in > postgresql.conf to around 200MB. This could be increased further and > further optimizations could be performed. > > No optimizations were done to elasticsearch. > > Conclusion: > > Scoped search is sufficient for our needs today for katello entities and > entities purely in backend systems (such as packages and errata). > Looking to the future if we aim to scale to a million or more hosts (for > example), we likely would want to consider more loosely integrating > elasticsearch in an optional manner for just entities that need it if > postgresql fails to perform well enough. > > Let me know if you have any questions > > -Justin > > > > PS > > The code for the queries themselves are: > > def scope_tests > [ > time("Errata"){Katello::Erratum.in_repositories(Katello::Repository.all[0..3]).order('updated').limit(20).all}, > > time("Errata > count"){Katello::Erratum.in_repositories(Katello::Repository.all[0..3]).count(:distinct > => true)}, > time("Errata type > filter"){Katello::Erratum.in_repositories(Katello::Repository.all[0..3]).where(:errata_type > => :security).order('updated').limit(20).all}, > time("Errata type filter > count"){Katello::Erratum.in_repositories(Katello::Repository.all[0..3]).where(:errata_type > => :security).count(:distinct => true)}, > time("Errata type > search"){Katello::Erratum.in_repositories(Katello::Repository.all[0..3]).search_for("type > = security").order('updated').limit(20).all}, > time("Errata type search > count"){Katello::Erratum.in_repositories(Katello::Repository.all[0..3]).search_for("type > = security").count(:distinct => true)}, > time("Package > name"){Katello::Erratum.in_repositories(Katello::Repository.all[0..3]).search_for("package_name > ~ a*").order('updated').limit(20).all}, > time("Package name > count"){Katello::Erratum.in_repositories(Katello::Repository.all[0..3]).search_for("package_name > ~ a*").count(:distinct => true)} > ] > end > > def es_tests > [ > time("Errata"){ Katello::Errata.search{ size 20; query{all}; filter > :and, [{:terms => {:repoids => > Katello::Repository.pluck(:pulp_id)[0..2]}}]} }, > time("Errata count"){ Katello::Errata.search{ query{all}; filter > :and, [{:terms => {:repoids => > Katello::Repository.pluck(:pulp_id)[0..2]}}]}.total }, > time("Errata type filter"){ Katello::Errata.search{ size 20; > query{all}; filter :and, [{:terms => {:type => [:security]}}, {:terms => > {:repoids => Katello::Repository.pluck(:pulp_id)[0..2]}}]} }, > time("Errata type filter count"){ Katello::Errata.search{ query{all}; > filter :and, [{:terms => {:type => [:security]}}, {:terms => {:repoids > => Katello::Repository.pluck(:pulp_id)[0..2]}}]}.total }, > time("Errata type search"){ Katello::Errata.search{ size 20; > query{string 'type:security'}; filter :terms, {:repoids => > Katello::Repository.pluck(:pulp_id)[0..2]}} }, > time("Errata type search count"){ Katello::Errata.search{ > query{string 'type:security'}; filter :terms, {:repoids => > Katello::Repository.pluck(:pulp_id)[0..2]}}.total }, > time("Package name"){ Katello::Errata.search{ size 20; query{string > 'pkglist.packages.name:*a*'}; filter :terms, {:repoids => > Katello::Repository.pluck(:pulp_id)[0..2]}} }, > time("Package name count"){ Katello::Errata.search{ query{string > 'pkglist.packages.name:*a*'}; filter :terms, {:repoids => > Katello::Repository.pluck(:pulp_id)[0..2]}}.total } > ] > end >

Hey, nice report, thanks!

> Both systems were using 6 GB ram, with 2 cpus each on a vm running on the
> same host. I loaded 200,000 fake errata into the database with a bz, cve,
> and three packages each and created 5 repositories with 20,000 errata each.
> I then wrote a few different queries that users would likely perform
> returning the total count and 20 items for each query and executed them10
> times averaging the results (time in seconds):

I don't understand the 3 packages 20,000 errata sizing. Is this
real-world setup? I'd expect something like 20,000 packages 2,000
errata.

> Note that initially the scoped search queries were much much slower and
> required a good deal of optimization adding various indexes. Prior to the
> optimization, most of the scope search queries were in the .5s to 1s range.
> Indexes already existed but were not adequate to achieve these final

Can you link the indexes patch PR?

Were you patching scoped_search gem by itself?

> Scoped search is sufficient for our needs today for katello entities and
> entities purely in backend systems (such as packages and errata). Looking
> to the future if we aim to scale to a million or more hosts (for example),
> we likely would want to consider more loosely integrating elasticsearch in
> an optional manner for just entities that need it if postgresql fails to
> perform well enough.

Can you give a little bit more context? Do I understand it right that
Katello is dropping Elasticsearch for now while you think you will
eventually adopt it again?

With that log of work done, I wonder if you want to carry on a bit with
investigating PostgreSQL optimization & clustering options as well. For
example dedicated psql replica tuned for scoped_search could give git
another boost in million-of-hosts scenarios. I am not sure how far can
we get with postgresql tuning in this scenario as you already improved
it with indexes.

··· -- Later, Lukas #lzap Zapletal