Pulpcore Telemetry and Katello

Hello community,

Pulpcore has a new telemetry feature that’ll be usable with Pulpcore 3.21. We’re slated to use Pulpcore 3.21 with Katello 4.7. I wanted to gather some feedback about how folks feel about the Pulpcore telemetry.

The data collected is quite simple: what Pulpcore plugins are in use, what Pulpcore “content applications” are running, and what Pulpcore workers are running.

Here are some more links with more info:

And here’s the data in action: https://analytics.pulpproject.org/

Currently, we have a PR out that will make the Pulpcore Telemetry opt-in with Katello. To enable it, you would need to use custom hiera to turn on the Telemetry variable. Few people will likely do so, however. So to help Pulp get more data to improve their software, I’m suggesting that we move eventually to an opt-out model and give plenty of warning and information about how to opt-out. Perhaps a warning after the installer runs. I’m not sure how we could make it more obvious than that, so I’d like to gather some suggestions.

I proactively made an installer issue for opt-out.

The important thing to take away from this is that we won’t enable Pulpcore Telemetry by default until we:

  1. Decide that it’s even a good idea with the help from the Foreman community, and
  2. Have a proper solution for making it obvious that some data will be uploaded to Pulp.
2 Likes

I appreciate asking the community, but should this be moved to the community section for better visibility?

Sure thing, that sounds like a better spot.

While I like supporting projects with data, most customers dislike, so I would prefer opt-in. Having to remember and set another installer option always can be “annoying”.

There is also sometimes the problem that environments are very restrictive regarding internet connections. So it will likely not work or even worse network team will complain at the operations team about regular connection attempts being blocked and you need to find and remove the unwanted connection which leads to a bad experience.

So not totally against an opt-out as you say you plan already to make it very visible, but I always consider opt-in as the nicer way of a project to treat users.

2 Likes

Appreciate the input, @Dirk. Having a second installer run just for opting out may not be the best user experience as well.

If we go down the opt-in route, we could consider suggesting the telemetry installer flag in the documentation for basic installations. I’d be curious to hear other ideas of advertising it well.

Hi I’m Brian a Pulp developer involved with https://analytics.pulpproject.org. I’d like to ask Katello not disable telemetry by default, and instead do two things:

  1. Make it clear that it’s enabled by default.
  2. Make it easy to opt-out of.

Pulp relies on this project data to make decisions that ultimately are and entirely for the benefit of users like Katello user. Some users don’t want to send telemetry, or can’t for a variety of reasons, and I want those users to easily be able to turn it off. My belief is the on-by-default strategy creates the most value for the most users while still allowing for full-knowledge for users who want to make a different choice.

How can this be made really easy to turn off and on-by-default in the installer?

Here’s an example that I hope helps share my perspective on the need here.

Typically Pulp launches new APIs as “tech-preview”, which means it’s not governed by Pulp’s semvar API policy yet and therefore could have breaking changes made to it at any point. Currently we have no data driven decision making around when to remove this label so Pulp as a project tends to just leave things as tech preview for very long periods of time. One example of telemetry is in guiding decisions like when to remove the tech preview label from API Foo, which has a lot to do with knowledge about how much usage that API is actually receiving relative to the bug reports received for API Foo.

Say we make an API feature Foo requested by Katello and it’s included in Katello. Telemetry being off by default I believe means that we will receive data that says API Foo is used almost none (just those users who opted it). That data suggests Pulp should keep API Foo in tech-preview for a very-long-if-not-forever kind of timeline. If Telemetry is on by default, I believe we’ll have much more accurate usage data. This can lead to outcomes where we can likely remove the tech preview label from API Foo when it’s observed it’s being used at scale yet there are no significant bug reports coming in related to that API.

I’ve held back on responding here since I’m not really a Pulp user, though I’ll say Foreman traditionally had a policy where we never sent telemetry. This isn’t quite true, since by default we retrieve the RSS feed on a regular basis and that has the Foreman version (x.y, not x.y.z) in the User-Agent header and obviously a request IP address will be used.

Then Puppetserver 7.0 also has started to send telemetry by default (Submitting usage telemetry). The installer has --puppet-server-puppetserver-telemetry as an option to turn this off, but this discussion inspired me to start Fixes #35728 - Disable telemetry by default by ekohl · Pull Request #851 · theforeman/puppet-puppet · GitHub to get back in line.

Today it’s off by default and pass --foreman-proxy-content-pulpcore-telemetry true/false to choose the behavior. Note that this can be passed on the initial installation, just like any option.

Fun fact: with Puppetserver I noticed it couldn’t upload from my IPv6-only installation since the Puppet endpoint was IPv4-only. I just checked and analytics.pulpproject.org is also IPv4-only. So the lesson is: deploying your hosts IPv6-only is the best way to avoid telemetry.

1 Like

As a Katello dev that works with Pulp a lot, I want as many users as possible to participate in Pulpcore Telemetry. I also get that many admins prefer the respect of not forcing Telemetry uploads in their environments.

Turning telemetry on or off via the Foreman Installer sounds inconvenient. We could have a copy/paste suggestion in the docs with a preset --foreman-proxy-content-pulpcore-telemetry true/false, but many power users probably won’t even notice those docs and will just run foreman-installer in whatever way they’re used to.

So, would it be possible for Pulpcore Telemetry to be configurable at run time? If it is, then Katello could have something in the UI that makes it very easy to enable or disable Pulpcore Telemetry. @bmbouter, I’m not sure how technically challenging this would be for Pulp.

We could even potentially go so far as to have a modal that pops up after an upgrade or a fresh install to ask for a decision about the telemetry.

Pulp’s settings are interpreted at startup so it needs to be set and then Pulp restarted. Given that, to me that means using the installer option and having users rerun it is the most viable. I don’t know much about the Katello installer though so that’s my naive suggestion.

Would it at all be possible for Telemetry to be a special setting that is settable via the API? That’s the only way I could see a UI setting being possible. I’m worried that we won’t find a suitable solution with the installer alone.

If we do decide to make it opt-out, I wonder if it needs to be included in the privacy policy.

2 Likes

The user persona of the person determining telemetry is the admin and they generally set settings via dynaconf not the API. Generally across Pulp installations, the API user of Pulp and the Pulp admin are not the same persona.

Since the installer option was used for this type of user decision before can we do what worked there again here?

I think adding something to the privacy policy would make sense.

We definitely can, but that would mean falling back on using only the Foreman Installer. The installer doesn’t have a “connection” to the UI, so we cannot provide a nice model/form in the UI asking to opt in or out (at least that’s my understanding).

If making an API call to Pulp to change the setting is not possible, @ekohl or @wbclark could we have a one-time check in the installer that requires a user to make a decision about the Telemetry?

If we choose the opt-out route, the installer would need to warn the user, so I feel like implementing that feature would get us close to having a forced decision anyway.

The Foreman Installer would remember the user’s choice, so they would only need to set it once.

As long as the TELEMETRY setting is configured in settings.py it must be managed by the installer. The user could manually edit settings.py and restart pulp services (if they are careful to do so at a time when nothing is happening) but the next installer run would overwrite their changes.

The installer has an interactive mode (although I don’t think I have heard of anybody actually using it). I don’t think that pausing the installer to wait for user input in the default (non-interactive) mode is something that we’d want to do. Not only would this break user expectations and the difference between interactive and default installer modes, but there is also tons of tooling that wraps the installer and expects non-interactive execution, whether that’s in CI, foreman-maintain, forklift, foreman operations collection, end user scripts, and so on…

The installer has a mechanism for remembering the user’s choice, and that mechanism is the scenario answers / hiera data. Basically, to determine what value to use for a given parameter, the installer will try lookups from various sources according to priority rules defined here: foreman-installer/foreman-hiera.yaml at 8362b7259643bc2191595043b2f25a46a91ac73e · theforeman/foreman-installer · GitHub (if those data sources are exhausted and a value for the parameter is still not found, then the default value from the relevant puppet module will be used)

The current behavior is that the puppet-foreman_proxy_content module has a $pulpcore_telemetry parameter with default value false. The issues that I would have with changing this default to true would be that telemetry data would be sent without the user’s consent until the user explicitly opts out by running foreman-installer --foreman-proxy-content-pulpcore-telemetry false, and this issue is compounded by the fact that running the installer can take 20 minutes of downtime, or sometimes even longer, for all katello services, per katello server and per content-proxy.

So, it is easy to imagine the frustration of an user who is running 30 content-proxies, and does not consent to provide (or their organizational policy prohibits providing) telemetry data, and to rectify the situation they must take significant downtime of all instances to disable it.

I also understand the point of view that, for the same reason, users would be disincentivized to opt-in to providing telemetry data.

So as a compromise to try to best satisfy all of the competing constraints that have been laid out in the discussion so far, what if we take the following approach:

  1. Continue to have telemetry disabled by default
  2. Pulp provides an API to check whether telemetry is enabled
  3. Katello uses this API to check whether telemetry is enabled, and if it is not, it provides a nice notification explaining how helpful this data is and asking the admin to please consider enabling it
  4. The admin can decline, and check a box that says “do not prompt me about this again” (or decline for now and be reminded again later)
  5. The admin can opt-in to telemetry, in which case they get a REX job that creates a drop-in file with the hiera data foreman_proxy_content::pulpcore_telemetry: true
  6. The configuration will take effect automatically whenever the installer next runs (whenever they next update or upgrade, or use the installer to modify any other infrastructure configuration)
  7. The REX job could in theory also immediately edit settings.py to match the value that the installer will eventually provide, so that the configuration takes effect at the next restart of pulp services, while also surviving future installer runs

Thoughts / feedback on this design, @iballou @bmbouter @ekohl and any other interested parties?

I appreciate the thought you’ve put into this, but the compromise proposal has some problems with it. One non-starter issue is that Pulp I don’t believe will provide these APIs. Also it doesn’t speak to my core concern which is Katello would be way underrepresented in Pulp project decision making by defaulting to off.

Given what I’ve written so far, it might surprise you to learn that I turn off telemetry on FOSS software I use. I get that as a personal choice. What I’m asking all of us to do is to set aside personal beliefs and ask ourselves, what is best for the broad base of users.

My belief is that most users don’t care about anonymous data collecting what version they are running. I think that’s why even though Katello must have tens of thousands of users I think I’ve only read one user reply so far. Also the folks who believe strongly that data privacy is the top concern tend to engage in these discussions most strongly which creates a bias that it’s the “majority perspective”.

My claim is that users (both Pulp and Katello) care a lot more about having Pulp make project decisions that don’t negatively impact them. If data privacy of anonymous data is a concern, users should read the release notes and use the installer option to disable it with an installer option. Is that really such a bad situation?

As a big user of Pulp, I feel we (Katello) have a responsibility to help Pulp improve their software. Granted that we haven’t had much user feedback, and the only feedback (thanks Dirk!) so far has indicated it being an annoyance at its worst, I’m leaning towards going opt-out and seeing if the community speaks up then. As Pulp adds more interesting data to collect, the benefits to them (and us subsequently) could be tremendous.

Let’s say we go the opt-out route. Here are the places that I think we would need to stick information:

  1. Docs install + upgrade guides
  2. Katello release announcement
  3. Release notes
  4. Privacy policy (with a revision of the 3rd party data statement)
  5. A note in the UI would be awesome… perhaps we could also update a Katello setting via the installer to tell us to show an informational banner about the telemetry collection
    • Maybe this is overkill?

Important:
We would also need regular updates to the community about changes in the data being collected. This would need to be announced with every new Pulp release that gets packaged with Katello. Katello meets with Pulp regularly, so we would be able to stay on top of this.

If users are sufficiently warned to turn off Telemetry at install or upgrade time, they should not need to ever have a second Foreman Installer run. The only impact will be remembering to add the installer argument.

@bmbouter what about a flag on the status endpoint that tells the telemetry status?

I like @wbclark 's idea about having a way to enable or disable via the UI. That sounds more up to “modern web application” standards. The use of Foreman Remote Execution removes the need for a “setter” on the Pulp API.

I think if you consider it from a non-Katello perspective it doesn’t make sense. The API user and the admin in Pulp are different personas. Users can’t act on that data.