Repositories and Sources¶
Repositories contain package tar files and are the primary vehicle for organizing and distributing packages. For more information on packages and repositories see the Package Ecosystem section.
In RStudio Package Manager, repositories are created from one or more sources. The documentation in this chapter outlines repositories as well as the types and structure of sources.
Repository types¶
RSPM supports repositories of three types:
- R - A CRAN-like repository for R packages. This is the default repository type.
- Bioconductor - A repository for Bioconductor R packages that supports use by BiocManager.
- Python - A Python repository with a source that mirrors the Python Package Index (PyPI).
Repository Structure¶
Package repositories have a specific structure that enables client commands
like install.packages
, pip install
, and BiocManager::install
to query the repository's contents and download packages.
A repository is just a set of files served from disk. RStudio Package Manager does not create repositories on disk. Instead, RStudio Package Manager maintains a single copy of each source and binary package, and uses a database and specialized web server to handle HTTP requests from R and Python.
Some example requests that can be served by the RStudio Package Manager:
PACKAGES file¶
Terminal
http://pkg-manager.example.com/repo/latest/src/contrib/PACKAGES
This serves a PACKAGES file. The PACKAGES file for a repository is human-readable and contains information on each package available in the repository. RStudio Package Manager can also serve requests for PACKAGES.gz and PACKAGES.rds.
Package Source¶
Terminal
http://pkg-manager.example.com/repo/latest/src/contrib/package_2.1.0.tar.gz
This request downloads the package source to the client.
Bioconductor Package Source¶
Terminal
http://pkg-manager.example.com/repo/packages/3.11/bioc/src/contrib/package_2.1.0.tar.gz
This request downloads the Bioconductor package source to the BiocManager client.
Archived Package Source¶
Terminal
http://pkg-manager.example.com/repo/latest/src/contrib/archive/package/package_1.1.0.tar.gz
This request downloads the tar file for an older, archived version of the package.
Most importantly, a RStudio Package Manager repository is a CRAN-like repository
which means users can access and install packages using their regular R
functions: install.packages
, available.packages
, packrat
, and
devtools::install
.
Simple page¶
Terminal
http://pkg-manager.example.com/repo/latest/simple/PACKAGE
The simple page is a PEP 503 and PEP 592 compliant endpoint that serves the PyPI Simple Repository API.
PyPI Package Source¶
Terminal
http://pkg-manager.example.com/repo/latest/packages/ID/PACKAGE#sha256=SHA256
This request downloads the PyPI package source to client. Links to these files can be found by visiting a packages simple page.
Repository Versioning¶
RStudio Package Manager tracks every change to a repository (or source) and associates
each change with a snapshot. Together, these snapshots create a
full versioned history of each repository. If a user wants to install packages
from a prior point in the repository's history they can do so by replacing the
/latest
component of the request URL with a snapshot id. snapshot ids
can be obtained in a repository's "Activity" log. The current snapshot id is
available in the "Setup" page.
Note
For projects that require strict reproducibility, we recommend configuring R to use a repository URL with a snapshot id.
Versioning is available for all repository and source types except Bioconductor, since Bioconductor includes its own versioning scheme.
Sources¶
About Sources¶
RStudio Package Manager repositories are composed of one or more sources. There are currently seven types of sources:
- cran source - A single cran source is automatically created. This source contains metadata and packages from RStudio's CRAN service. The source can be used directly in a repository to give users access to all CRAN packages, or it can used indirectly by curated-cran sources.
- pypi source - A single PyPI source is automatically created. This source contains metadata and packages from RStudio's PyPI service. The source can be used directly in a Python repository to give users access to all PyPI packages.
- bioconductor source - One
bioconductor
source per Bioconductor version is automatically created when a Bioconductor repo is present. If no Bioconductor repo is present, individual Bioconductor sources (one per Bioconductor version) can be created manually. Bioconductor sources can be used directly by R (CRAN-like) repos, and Bioconductor repos use all Bioconductor sources automatically. - curated-cran source - A curated CRAN source allows administrators to specify specific sets of approved CRAN packages. Administrators can add or remove packages from the set, and they can also update the set. See the Curated CRAN Source section for more information.
- cran-snapshot source - A CRAN source that is pinned to a specific CRAN snapshot. Administrators can periodically update the snapshot to which the cran-snapshot source is pinned. See the CRAN Snapshot Source section for more information.
- local source - A local source is used as a mechanism to distribute locally developed packages or other packages without native support in RStudio Package Manager. Administrators add packages to local sources by specifying a path to a package's tar file.
- git source - A git source allows RStudio Package Manager to automatically
make packages in Git available to R users through
install.packages
(without requiringdevtools
). Git sources work for internal packages as well as external sites such as GitHub. Packages can be automatically updated on each commit or when a new Git tag is pushed.
Note
While the CRAN and PyPI sources are created automatically, an administrator must use the CLI before any metadata or packages are downloaded to RStudio Package Manager. See the CLI section for more information on making CRAN available through RStudio Package Manager.
Repositories with Multiple Sources¶
A repository can have more than one source. If you wish to serve both local packages and CRAN packages from a single repository, you can create a single repository that subscribes to multiple sources. For example:
- all (a repository)
- internal (local source)
- cran (CRAN source)
The "all" repository above gives users access to both local and CRAN packages, and its PACKAGES list could be accessed, for example, at https://packagemanager.rstudio.com/all/latest/src/contrib/PACKAGES. A repository subscribes to sources, which means that changes to a source will be reflected in the repository. For example, if an administrator adds a new package to the internal
source, users will automatically be able to access the new package via the all
repository.
Package Conflicts Between Sources¶
If a repository has multiple sources and a package with the same name exists in both sources, RStudio Package Manager eliminates duplicates, giving preference in the order the sources are subscribed. In the example repository above, if a package named "plumber" exists in both the "cran" and "internal" sources, the "plumber" package from the "internal" source would be served and listed since it is the first source for the repository. The same conflict resolution occurs as sources change. For example, in the sample above, even if a new package is added to CRAN with the same name as an internal package, the internal package will continue to be served. The precedence is also maintained during updates. In the example above, the internal version of plumber will continue to be served even if the CRAN version of plumber is updated. The order of sources within a repository can be re-arranged using the reorder
command.
What is RStudio's Package Service?¶
RStudio Package Manager doesn't download packages directly from CRAN, Bioconductor, or PyPI. Instead, RStudio maintains a curated S3 bucket that contains metadata about CRAN, Bioconductor, and PyPI, in addition to package tar files. The metadata is used to track day-to-day changes.
See the Air-Gapped RStudio Package Manager section if your environment does not have access to the RStudio Package Service
During a sync, the metadata is downloaded to RStudio Package Manager. The metadata is compared against the RStudio Package Manager database to determine what changes need to be applied. Package tarballs are then downloaded to the cache on demand.
See the Package Security section for details about the security measures that are in place for the RStudio Package Service.
Publishing Snapshots to the RStudio Package Service¶
We evaluate CRAN, Bioconductor, and PyPI each business day and publish new snapshots when updates are available. Then, any RSPM installations sync these snapshots based on their configured schedules. For example, suppose a CRAN package gets updated on Saturday. We will publish a new snapshot to the RStudio Package Service sometime on Monday, usually Monday afternoon. Then, the user's default CRAN sync will pick up the change on Tuesday at 12:00am. Occasionally, if there are very important updates, we generate an extra snapshot to make the updates available sooner. The timing of each snapshot varies based on the number of changes and the number of dependencies involved.
To be sure RStudio Package Manager synchronizes new snapshots as soon as possible, consider setting your sync schedule to occur more than once per day.
Why is the newest package version from CRAN not available yet?
We currently evaluate CRAN, Bioconductor, and PyPI for updates each business day. However, this schedule is subject to change without notice. Large updates, external repository errors or inconsistencies, and other unanticipated situations may cause delays.
Package Fetching¶
RStudio Package Manager fetches packages on-demand as they are requested by end users. Package Manager will still download the CRAN, Bioconductor, and PyPI metadata on the sync schedule to keep the RStudio Package Manager database updated. The database serves as the source of truth for package availability. The benefit of on-demand fetching is a smaller footprint in terms of network bandwidth and disk space.
Package Caching¶
RStudio Package Manager downloads each version of a package only once, and always checks the local cache to see if the required tar file is already available.
Synchronizing with the RStudio Package Service¶
The CRAN, Bioconductor, and PyPI sources are synchronized according to a schedule set using the
SyncSchedule
properties in the RStudio Package Manager configuration file.
These properties accept a string in crontab format. See
the appropriate appendix section below.
By default, the configuration file includes crontabs that will cause RStudio Package Manager to sync once a day (early morning in the server's timezone), if any of the following conditions have been met:
- For CRAN:
- Any repository subscribes to the
cran
source. - A "curated-cran" source is used by any repo.
- A manual sync has been run with the
sync --type=cran
command.
- Any repository subscribes to the
- For Bioconductor:
- Any repository subscribes to a Bioconductor source.
- A Bioconductor repo has been created.
- For PyPI:
- Any repository subscribes to the
pypi
source. - A manual sync has been run with the
sync --type=pypi
command.
- Any repository subscribes to the
A sync schedule will not be applied if the above conditions are not met. If
you only want manual syncs, change the configuration file to have a blank
value for SyncSchedule
:
;/etc/rstudio-pm/rstudio-pm.gcfg [CRAN] SyncSchedule = "" [Bioconductor] SyncSchedule = "" [PyPI] SyncSchedule = ""
Note
For more information on setting the PyPI schedule, see the PyPI schedule section.
The SyncSchedule
property does not necessarily determine when a repository
will make updated packages available to users. For example, if a repository subscribes
directly to the cran
source, users will see updates according to the sync
schedule. In contrast, if the repository subscribes to a curated CRAN source, an
administrator must explicitly update the source in order for updates to become
available.
In addition, updating the repository does not automatically push updated packages to R clients. A repository specifies what packages are available, but the R user is in control of when and how to update the packages used by a project.
See the section on Managing Change Control for more information.
RStudio Package Manager keeps track of old versions of packages as well. Old versions of packages are available in the repository's archive, and are listed in the RStudio Package Manager web UI. This allows users to roll back updates if necessary or install packages as they existed at a prior time.
Note
Source snapshots are only retrieved based on your server uptime and configured SyncSchedule
, which could cause
delays between new snapshots being available and your server downloading them. If you absolutely need the latest
snapshots as soon as possible, we recommend increasing the SyncSchedule
to every few hours. For example the
crontab 0 */4 * * *
would run every four hours.
Source Type Details¶
The CRAN Source¶
A primary use case for RSPM is making packages in public repositories, like CRAN, available to enterprise users. Administrators can elect to make all of CRAN available, or to make only curated subsets of CRAN available.
Server log messages related to this component can be shown by enabling the sync
region.
More information about activating log regions is in the configuration appendix Debug section.
Bioconductor Sources¶
Similar to CRAN, RSPM makes packages from Bioconductor available
to enterprise users. Administrators can make all of Bioconductor available, or limit
Bioconductor availability to specific Bioconductor versions. Bioconductor packages can
be accessed via Bioconductor repos with BiocManager, or they can be accessed
via R repos with install.packages
.
RSPM supports Bioconductor versions 3.1 (for R 3.2) and greater.
The PyPI Source¶
Another popular use case for RSPM is making PyPI packages in public repositories available to enterprise users. Administrators can make all of PyPI available by following the instructions in the Python section.
Curated CRAN Sources¶
Curated CRAN sources allow administrators to create and update approved subsets of CRAN. The behavior is best explained in an example.
Assume that RStudio Package Manager has been configured to sync CRAN updates daily.
January 1st - An administrator creates a curated CRAN source and is given a list of desired packages.
January 2nd - The administrator can use the add
command supplying the list of
desired packages. RStudio Package Manager will identify all of the required
dependencies and create a proposal. The proposal includes the set of packages
to be added as well as information about each package, such as license type.
This information can be used to facilitate an external review process.
January 15th - The proposal is approved. The administrator returns to RStudio
Package Manager and runs the add
command again with a transaction ID included
in the proposal. The set of packages is added from CRAN as they existed on
January 1st, the date the source was created.
January 20th - The administrator receives a request to add a new package to the
set of approved packages. The admin uses the add
command supplying the new
package as an argument. RStudio Package Manager will create a proposal using
the version of CRAN as it existed on January 1st. In order to ensure compatibility
between the packages added to the source, RStudio Package Manager will add to the
set of packages by pulling from CRAN as it existed the day the source was created.
As before, if the proposal is accepted, the admin can commit the changes.
January 30th - Now the administrator gets a request to update the approved
packages. In order to keep all packages consistent, the entire set is updated at
once using the update
command. Like the add
command, the update
command
will enumerate all the changes needed to update the set of packages from
January 1st to January 30th.
February 1st - The proposal is approved and the administrator completes the
update
command by using the transaction ID included during the initial update.
The set of packages is now tied to CRAN on January 30th. Future add commands
will use this pinned date, until another update sequence occurs.
To summarize, curated CRAN sources allow admins to create a subset of CRAN at a point in time. Administrators can add packages to the subset from the same frozen point in time. Administrators can also update the subset to a newer point in time. Each change supports a creating a proposal and a confirmation run that applies the proposal.
Given a list of desired packages, RStudio Package Manager automatically
determines the full set of dependencies and also tracks those dependencies over
time. Admins can elect to include suggested dependencies or only required
dependencies by using the include-suggests
flag. During each update, older
versions of packages are archived, ensuring that tools like packrat and RStudio
Connect work seamlessly with the curated CRAN subset.
The update
command will be impacted by the sync schedule defined on the
server. If the server only syncs every few weeks, update
will only reference
the latest data from CRAN available on the server.
CRAN Snapshot Sources¶
CRAN Snapshot sources allow administrators to create full CRAN sources that are pinned to a specific CRAN snapshot. Administrators can periodically update the snapshot to which the source is pinned. For example,
- If your organization has previously used MRAN snapshots, you can easily onboard to RSPM by replicating those snapshot dates.
- If your organization has historically installed packages all at once into a system library, for instance when new R versions are provisioned, you can use a CRAN snapshot to easily achieve the same effect.
- If your organization wants to "lag" behind CRAN, you can use a CRAN snapshot source and regularly update the source to a CRAN snapshot that trails the current CRAN release.
Git Sources¶
Git sources allow RStudio Package Manager to automatically expose R packages tracked in Git. Git sources work with internal packages as well as external sites such as GitHub.
Git sources require a configured R installation.
Git Builders¶
RStudio Package Manager defines a git-builder
as an entity that watches a
remote Git endpoint (e.g., git@github.com:user/example.git
) for changes and
builds R package bundles.
An admin follows these steps:
- Create a git source.
- Create a
git-builder
for the source, specifying whether to watch for commits to a Git branch or tags in a Git repository. The endpoint can be HTTP or SSH (see below). See thecreate git-builder
command for full details, e.g., how to track a specific branch. - Based on the selection specified with the
create git-builder
command, RStudio Package Manager clones the Git endpoint and runs an R job to transform the Git clone into a package bundle. The package bundle is made available to any repositories subscribing to the source. - RStudio Package Manager polls the Git endpoint to watch for either new
commits or new tags (based on the selection specified with the
create git-builder
command). If an update is available, RStudio Package Manager automatically pulls the new changes and launches an R job. The R job creates a package bundle from the updated Git clone and updates the package available in the git source. Previous versions are archived. - Users install the package from the repository via
install.packages
NOTdevtools
.
See the quickstart guide for a specific example.
Server log messages related to this component can be shown by enabling the git
region.
More information about activating log regions is in the configuration appendix.
Access restricted Git endpoints using SSH keys¶
If Git builders require authentication, RStudio Package Manager can use SSH keys to authenticate against the endpoint.
Begin by creating an SSH key and granting the SSH key access to the Git
endpoint. The specific steps will depend on your Git provider. Once you have the
path to the SSH key, use the import
command to securely name and store the SSH key for later use by
RStudio Package Manager. If desired, you can now remove the SSH key file.
Multiple keys can be imported.
To use the newly imported SSH key with a new Git builder, specify the key
name with the --ssh-key
flag in the create git-builder
command.
SSH Key Security¶
RStudio Package Manager encrypts and stores imported SSH keys in the metadata
database. Any person (by default, members of the rstudio-pm
unix
group) with access to the admin CLI can:
- Associate an imported key with a Git builder using the
create git-builder
command - List the names of available SSH keys using the
list ssh-keys
command
Users cannot access the contents of the key, nor is the key available for arbitrary actions. We recommend granting SSH keys imported to RStudio Package Manager limited read-only access to only the endpoints you wish to expose as R packages.
When imported, the keys are encrypted at rest, during Git operations which require SSH, the keys are added to an ssh-agent and thus never written to the filesystem or written to STDIN.
Although RStudio Package Manager allows the use of SSH keys with no passphrase, it is still recommended to use a strong SSH key with a passphrase.
Commits vs Tags¶
A package based on a Git endpoint can can be configured to watch one of two types of changes: "commits" or "tags". In short, "commits" watches for changes to a specified Git branch, where "tags" watches for new tags in the whole Git repository. In more detail:
-
Commits - RStudio Package Manager will update the package any time new commits are discovered in a branch. In this mode, RStudio Package Manager automatically modifies the package's version, assigning a unique version number to each build. The version number is created based on the commit time-stamp and is designed to avoid conflicts with the version scheme used by the package author. For example, if the Description file for a package indicates a version of
1.1-3
, the automatic version number would be:1.1-3.0.0.0.1537204599
. If the author updates the package with a new commit, but keeps the version in the Description file the same, the new automatic version number would reflect the new commit time-stamp, e.g.1.1-3.0.0.0.1537218677
. This process ensures that users of the package always get the correct behavior frominstall.packages
, with newer commits being associated with a semantically higher version number. -
Tags - RStudio Package Manager will update the package any time a new Git tag is discovered. In this mode, RStudio Package Manager retains the version specified in the package's Description file. This mode is designed to work when a Git tag is used to indicate a package release. Note: The name of the tag must match the version in the Description file. For example, if your package's Description file has
Version: 5.4.2
your tag must be either5.4.2
orv5.4.2
. If two tags reference the same version, preference is given to the newer tag. If a newer tag references an older version than a prior tag, the new tag is built as an archived package. If a tag is removed from a Git endpoint, the package is deleted.
Commit mode is recommended for bleeding edge repositories, whereas tag mode is suitable for exposing stable releases of packages.
A git source can support different packages with different modes. However, a given package can only have one mode in a source. If you would like to surface the same package in both commit and tag mode, you must create two git sources.
Git directories¶
By default, packages will be built from the git root directory. If the R package
exists in a different location, it can be specified using the --sub-dir
flag when adding a git package.
Managing Packages from Git¶
RStudio Package Manager automatically handles updating and archiving packages in
git sources as the Git endpoints change. Additionally, the package artifacts
themselves can be manually removed using the remove
command.
Deleting a git source with the delete
command removes all the packages
generated by the Git builder and removes all the metadata about the Git endpoints.
Finally, it is possible to keep the package artifacts already created but stop RStudio Package Manager from tracking the Git endpoint. To do so, use:
Terminal
rspm delete git-builder --name=[name of package] --source=[name of source]
To view information about the current Git endpoints that are being tracked, use:
Terminal
rspm list git-builders
Combining packages from Git(Hub) with other package sources¶
Local packages cannot be added manually to a git source, but a repository can surface packages from a git source alongside local packages and CRAN packages by subscribing to multiple sources. Take care when managing a repository's subscriptions as order is important, see the Multiple Sources section.
Polling Frequency¶
You can control how frequently RStudio Package Manager checks for updates using
the Git.PollInterval
configuration field. If
multiple commits occur between checks, RStudio Package Manager will create a
single version representing all of the changes. If multiple tags are created or
removed between checks, RStudio Package Manager will build each tag
individually, automatically archiving tags representing older versions of the
package.
Repository Versioning is identical in all source types, including git sources.
Tracking Changes and Errors¶
If a repository subscribes to a git source, you can view the git source's history in the Activity Log. The Activity Log will identify each change to a package including the new version, and a message will indicate the associated Git tag or commit as appropriate. If an error is encountered attempting to clone, poll, or bundle a package, the Activity Log will record the attempt and include a message with the CLI command to be run to view a full error log.
You can also use the following RSPM CLI commands to quickly check your active Git builders and view the logs:
Terminal
$ rspm list git-builders << Git Builders: << - [git package name] << Source: [source name] << URL: [source url] << Trigger: [git package trigger] << Key: none
Terminal
$ rspm list git-builds --source=[source name] --name=[git package name] << Git Builds: << - [git package name] << Transaction ID: [transaction ID] << SHA: [SHA] << Tag: [tag] << Status: [job status] << Time: [time of run] << Only showing latest build, for more builds use the --count and --page flags << For more information run: rspm logs --transaction-id=[transaction ID]
Terminal
$ rspm logs --transaction-id=[transaction ID] << ... << [git package run logs] << ...
RStudio Package Manager automatically tries to build updates from a Git source 3 times. If the build fails more than 3 times, the update causing the failure is ignored. New updates are still discovered and built.
To retry a failed update, or to force a Git builder to rebuild the latest package
version, use the rerun
command:
Terminal
rspm rerun git-builder \ --name=[package name] \ --source=[source name] \ --tag=[tag to rebuild, only required if the build trigger is tags]
To aid in debugging, it can help to view output from the git commands
that are run as well as output from the SSH connection when applicable.
To enable debugging, refer to the Debug.Log
configuration property in
the configuration appendix.
To enable the debug log temporarily without restarting the server use the
rspm config
command:
Terminal
rspm config debug logger activate git