| Title: | Access data from biological sequence databases like NCBI, ENA, MGnify |
|---|---|
| Description: | This package interacts with online biological sequence databases. It provides functions to search for sequences, convert identifiers and download sequences and associated metadata. |
| Authors: | Tamas Stirling |
| Maintainer: | Tamas Stirling <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1 |
| Built: | 2026-05-22 18:36:09 UTC |
| Source: | https://github.com/stitam/webseq |
Connect to a Dataset on a Dataverse server
dv_connect(dataset, version = ":latest", server, apikey = "")dv_connect(dataset, version = ":latest", server, apikey = "")
dataset |
character, "persistentId" of the Dataset. |
version |
character, version of the Dataset. |
server |
character, URL of the Dataverse server. |
apikey |
character, your API key for the Dataverse server. |
... |
Additional arguments to pass to 'dataverse::get_dataset()'. |
The persistentId and the server address can be obtained from the URL of the Dataset. The URL starts with the server address, and the persistentId is the value of the "persistentId" query parameter. For example, in the URL "https://dataverse.no/dataset.xhtml?persistentId=doi:10.18710/NCKZD7", the server address is "https://dataverse.no" and the persistentId is "doi:10.18710/NCKZD7".
If the Dataset is private, you will need to provide your API key to access it. You can obtain an API key by creating an account on the Dataverse server and generating a new API token in your account settings.
A connection object with Dataset metadata.
The function returns an attribute called "conn" which contains the connection information you provided when you called the function. If you are accessing a private Dataset, this attribute will contain your API key!
[dv_list_files()], [dv_download()]
# connect to a public dataset on dataverse.no conn <- dv_connect( dataset = "doi:10.18710/NCKZD7", server = "https://dataverse.no" ) # connect to a private dataset on dataverse.no conn <- dv_connect( dataset = "private_persistentId_here", server = "https://dataverse.no", apikey = "your_api_key_here" )# connect to a public dataset on dataverse.no conn <- dv_connect( dataset = "doi:10.18710/NCKZD7", server = "https://dataverse.no" ) # connect to a private dataset on dataverse.no conn <- dv_connect( dataset = "private_persistentId_here", server = "https://dataverse.no", apikey = "your_api_key_here" )
Download files from a Dataset on a Dataverse server
dv_download(files, conn, dirpath = NULL, verbose = getOption("verbose"))dv_download(files, conn, dirpath = NULL, verbose = getOption("verbose"))
files |
a tibble of files from a 'dv_connection' object, typically obtained from [dv_list_files()] plus optional filtering. Must contain columns called 'filename', 'id' and 'md5'. |
conn |
a 'dv_connection' object from [dv_connect()]. |
dirpath |
character, path to the directory where files will be saved. Defaults to the current working directory. |
verbose |
logical, should verbose messages be printed to the console? |
A character vector of file paths for successfully downloaded files, or 'NA' for files that failed to download. Returned invisibly.
[dv_connect()], [dv_list_files()]
conn <- dv_connect( dataset = "doi:10.18710/NCKZD7", server = "https://dataverse.no" ) conn |> dv_list_files() |> dplyr::filter(grepl("\\.fasta$", filename)) |> dv_download(conn = conn)conn <- dv_connect( dataset = "doi:10.18710/NCKZD7", server = "https://dataverse.no" ) conn |> dv_list_files() |> dplyr::filter(grepl("\\.fasta$", filename)) |> dv_download(conn = conn)
List files in a Dataset on a Dataverse server
dv_list_files(conn)dv_list_files(conn)
conn |
a 'dv_connection' object from [dv_connect()]. |
A tibble with information about the files in the Dataset.
[dv_connect()], [dv_download()]
conn <- dv_connect( dataset = "doi:10.18710/NCKZD7", server = "https://dataverse.no" ) conn |> dv_list_files()conn <- dv_connect( dataset = "doi:10.18710/NCKZD7", server = "https://dataverse.no" ) conn |> dv_list_files()
Download sequencing reads from the European Nucleotide Archive (ENA)
ena_download_reads( accession, type = "fastq", dirpath = NULL, mirror = TRUE, verbose = getOption("verbose") )ena_download_reads( accession, type = "fastq", dirpath = NULL, mirror = TRUE, verbose = getOption("verbose") )
accession |
character; a character vector of ENA Accession IDs. |
type |
character; A character string specifying the type of file to
download, either |
dirpath |
character; the path to the directory where the file should be
downloaded. If |
mirror |
logical; should the download directory mirror the structure of the FTP directory? |
verbose |
logical; should verbose messages be printed to console? |
If type == "run" the function will download the read files that were
originally submitted to ENA. If a run was originally submitted to another
INSDC database (NCBI SRA, DDBJ) then a file of this category will not be
available. If type == "fastq" the function will download one or more
FASTQ files for each run. If type = "err" the function will download
files that can be used with NCBI's SRA Toolkit.
https://ena-docs.readthedocs.io/en/latest/retrieval/file-download/sra-ftp-structure.html
## Not run: ena_download_reads("ERR1649607", type = "fastq", verbose = TRUE) ## End(Not run)## Not run: ena_download_reads("ERR1649607", type = "fastq", verbose = TRUE) ## End(Not run)
Retrieve sequences from ENA
ena_query( accessions, mode = "fasta", expanded = FALSE, annotation_only = FALSE, line_limit = 0, download = FALSE, destfile_by = "all", gzip = FALSE, set = FALSE, include_links = FALSE, range = NULL, complement = FALSE, batch_size = 0, verbose = getOption("verbose") )ena_query( accessions, mode = "fasta", expanded = FALSE, annotation_only = FALSE, line_limit = 0, download = FALSE, destfile_by = "all", gzip = FALSE, set = FALSE, include_links = FALSE, range = NULL, complement = FALSE, batch_size = 0, verbose = getOption("verbose") )
accessions |
character; Accessions to query. |
mode |
character; Can be either |
expanded |
logical; Get expanded records for CON sequences. |
annotation_only |
logical; Only retrieve annotation, no sequence. |
line_limit |
integer; Limit the number of text lines returned. |
download |
logical; Download the result as a file. |
destfile_by |
character; Number of files to download.
|
gzip |
logical; Download the result as a gzip file. |
set |
logical; ??? |
include_links |
logical; ??? |
range |
character; ??? |
complement |
logical; ??? |
batch_size |
integer; Number of accessions to query in a single request. Using this value, accessions will be broken down into one or more batches. If set to 0, all accessions will be queried in a single request. |
verbose |
logical; Should verbose messages be printed to the console? |
## Not run: ena_query("LC136852") ena_query(c("LC136852", "LC136853")) ## End(Not run)## Not run: ena_query("LC136852") ena_query(c("LC136852", "LC136853")) ## End(Not run)
Take a vector of ENA accessions and convert them to NCBI accessions.
ena2ncbi(accessions, type)ena2ncbi(accessions, type)
accessions |
character; a vector or ENA accessions. |
type |
character; type of ENA accessions. Supported types: 'sample', 'study'. |
A tibble with two columns, 'ena' and 'ncbi'.
ena2ncbi("ERS3202441", type = "sample") ena2ncbi(c("ERS3202441", "ERS3202442"), type = "sample") ena2ncbi("ERP161024", type = "study")ena2ncbi("ERS3202441", type = "sample") ena2ncbi(c("ERS3202441", "ERS3202442"), type = "sample") ena2ncbi("ERP161024", type = "study")
This data set contains a list of IDs which can be used to access data from various data sources. These IDs are used across the package in function documentations, tests, vignettes.
examplesexamples
A list with 6 elements:
NCBI Assembly IDs
NCBI BioProject IDs
NCBI BioSample IDs
NCBI Gene IDs
NCBI Protein IDs
NCBI SRA IDs
All assembly reports contain GenBank and/or RefSeq identifiers that uniquely identify a contig. This function can be used to extract both GenBank and RefSeq accessions a parsed assembly report.
extract_accn(report)extract_accn(report)
report |
list; a parsed assembly report. use |
a data frame
This is the fifth step within the pipeline for downloading GenBank files.
get_genomeid,
get_report_url(),
download_report(),
parse_report(),
download_gb()
## Not run: phages <- get_genomeid("Autographiviridae", db = "assembly") report_url <- get_report_url(phages$ids[1]) download_report(report_url) filename <- dir(paste0(tempdir(), "/assembly_reports")) filepath <- paste0(tempdir(), "/assembly_reports/", filename) rpt <- parse_report(filepath) extract_accn(rpt) ## End(Not run)## Not run: phages <- get_genomeid("Autographiviridae", db = "assembly") report_url <- get_report_url(phages$ids[1]) download_report(report_url) filename <- dir(paste0(tempdir(), "/assembly_reports")) filepath <- paste0(tempdir(), "/assembly_reports/", filename) rpt <- parse_report(filepath) extract_accn(rpt) ## End(Not run)
Some functions may download files that only differ in their source (e.g. GCA from GenBank assemblies or GCF for RefSeq assemblies) or their version number (v1, v2, etc.). This function helps remove redundant files by flagging which files should be kept for further analysis.
flag_files(filenames)flag_files(filenames)
filenames |
character; a character vector of filenames. Currently the function only supports GCA/GCF identifiers. Look at the examples for more details. |
The function first prioritises GCF over GCA and then the highest version number.
The function returns a data frame where each file is listed in the first column and the recommendation to keep the file for further analysis is listed in the last column.
# keep GCF filenames <- c("GCA_003012895.2_ASM301289v2_genomic.fna", "GCF_003012895.2_ASM301289v2_genomic.fna") flag_files(filenames) # keep GCF even when version number is lower filenames <- c("GCA_003012895.2_ASM301289v2_genomic.fna", "GCF_003012895.1_ASM301289v1_genomic.fna") flag_files(filenames) filenames <- c("GCA_003012895.1_ASM301289v1_genomic.fna", "GCA_003012895.2_ASM301289v2_genomic.fna") flag_files(filenames)# keep GCF filenames <- c("GCA_003012895.2_ASM301289v2_genomic.fna", "GCF_003012895.2_ASM301289v2_genomic.fna") flag_files(filenames) # keep GCF even when version number is lower filenames <- c("GCA_003012895.2_ASM301289v2_genomic.fna", "GCF_003012895.1_ASM301289v1_genomic.fna") flag_files(filenames) filenames <- c("GCA_003012895.1_ASM301289v1_genomic.fna", "GCA_003012895.2_ASM301289v2_genomic.fna") flag_files(filenames)
This is a utility function that converts a list of lists, each containing a names list of data frames into a single flat list of data frames.
flatten(x)flatten(x)
x |
list; a list which contains a named list of data frames |
a flat list of data frames where lists of data frames with the same name are merged.
This functions queries MGnify for all available endpoints
mgnify_endpoints(verbose = getOption("verbose"))mgnify_endpoints(verbose = getOption("verbose"))
verbose |
logical; should verbose messages be printed to console? |
a tibble of API-s and their respective endpoints
The function prints contents of the following url: https://www.ebi.ac.uk/metagenomics/api/v1/
## Not run: mgnify_endpoints(verbose = TRUE) ## End(Not run)## Not run: mgnify_endpoints(verbose = TRUE) ## End(Not run)
This function can be used for searching MGnify using an identifier.
mgnify_instance(query, from)mgnify_instance(query, from)
query |
character; the indentifier |
from |
character; the api which contains this identifier. See
|
a list
## Not run: # look up an assembly mgnify_instance("ERZ477576", from = "assemblies") ## End(Not run)## Not run: # look up an assembly mgnify_instance("ERZ477576", from = "assemblies") ## End(Not run)
This function retrieves a list of identifiers to look up with other functions.
mgnify_list( query, from, from_id, page = NULL, sleep = 0.2, verbose = getOption("verbose") )mgnify_list( query, from, from_id, page = NULL, sleep = 0.2, verbose = getOption("verbose") )
query |
character; what to look for. |
from |
character; API. See |
from_id |
character; more precise filtering for the API. |
page |
numeric; the API's response is paginated this tells the API which
page to return. If |
sleep |
character; number of seconds to sleep before requesting the next page. |
verbose |
logical; should verbose messages be printed to console? |
## Not run: # Query samples collected from biogas plants mgnify_list(query = "samples", from = "biomes", from_id = "root:Engineered:Biogas plant", page = 1) ## End(Not run)## Not run: # Query samples collected from biogas plants mgnify_list(query = "samples", from = "biomes", from_id = "root:Engineered:Biogas plant", page = 1) ## End(Not run)
This function directly downloads genome data through the NCBI FTP server.
ncbi_download_genome( query, type = "genomic.fna", dirpath = NULL, mirror = FALSE, verbose = getOption("verbose") )ncbi_download_genome( query, type = "genomic.fna", dirpath = NULL, mirror = FALSE, verbose = getOption("verbose") )
query |
an object of class 'ncbi_uid', 'ncbi_uid_link', 'ncbi_link', or an integer vector of NCBI Assembly UIDs. See Details for more information. |
type |
character; the file extension to download. Valid options are
|
dirpath |
character; the path to the directory where the file should be
downloaded. If |
mirror |
logical; should the download directory mirror the structure of the FTP directory? |
verbose |
logical; should verbose messages be printed to console? |
'ncbi_get_uid()' returns an object of class 'ncbi_uid'; 'ncbi_link_uid' returns an object of class 'ncbi_uid_link'; 'ncbi_link' returns and object of class 'ncbi_link'. These objects may be used directly as query input for 'ncbi_download_genome'. It is recommended to use this approach. Alternatively, you can also use a character vector of UIDs as query input. This approach is not recommended because there are no consistency checks, the function will just attempt to interpret the query as NCBI Assembly UIDs.
## Not run: # Download a single genome ncbi_get_uid("GCF_003007635.1", db = "assembly") |> ncbi_download_genome() "SAMN08619567" |> ncbi_get_uid(db = "biosample") |> ncbi_link_uid(to = "assembly") |> ncbi_download_genome() "SAMN08619567" |> ncbi_link(from = "biosample", to = "assembly") |> ncbi_download_genome() # Download multiple genomes, mirror FTP directory structure data(examples) examples$assembly |> ncbi_get_uid(db = "assembly") |> ncbi_download_genome() ## End(Not run)## Not run: # Download a single genome ncbi_get_uid("GCF_003007635.1", db = "assembly") |> ncbi_download_genome() "SAMN08619567" |> ncbi_get_uid(db = "biosample") |> ncbi_link_uid(to = "assembly") |> ncbi_download_genome() "SAMN08619567" |> ncbi_link(from = "biosample", to = "assembly") |> ncbi_download_genome() # Download multiple genomes, mirror FTP directory structure data(examples) examples$assembly |> ncbi_get_uid(db = "assembly") |> ncbi_download_genome() ## End(Not run)
This function retrieves sequence metadata from a given NCBI sequence database.
ncbi_get_meta( query, db = NULL, batch_size = 100, use_history = TRUE, parse = TRUE, mc_cores = NULL, verbose = getOption("verbose") )ncbi_get_meta( query, db = NULL, batch_size = 100, use_history = TRUE, parse = TRUE, mc_cores = NULL, verbose = getOption("verbose") )
query |
either an object of class |
db |
character; the database to search in. For options see
|
batch_size |
integer; the number of search terms to query at once. If
the number of search terms is larger than |
use_history |
logical; should the function use web history for faster API queries? |
parse |
logical; should the function attempt to parse the output into a tibble? |
mc_cores |
integer; number of cores to use for parallel processing. Only
used if |
verbose |
logical; Should verbose messages be printed to console? |
You can give UIDs to ncbi_get_meta() in two ways: 1. You can
use functions like ncbi_get_uid() or ncbi_link_uid to get UIDs,
and then use the returned ncbi_uid objects directly with
ncbi_get_meta. If you follow this approach then you do not have to
specify the db argument since the function can extract it from the
ncbi_uid object. However, if you do provide it, then it must be
identical to the db attribute of the "ncbi_uid" object. 2.
Alternatively, you can just provide a vector of UIDs, but then you must
specify the db argument as well.
If parse = FALSE the function will return an object of class
ncbi_meta, which is a character vector with some extra information
about the database. This output can be used directly with ncbi_parse.
If parse = TRUE the function will attempt to parse the data using
ncbi_parse. If parsing is successful, the function will return a
tibble, otherwise it will return the unparsed ncbi_meta object.
## Not run: data(examples) uids <- ncbi_get_uid(examples$biosample, db = "biosample") meta <- ncbi_get_meta(uids) ## End(Not run)## Not run: data(examples) uids <- ncbi_get_uid(examples$biosample, db = "biosample") meta <- ncbi_get_meta(uids) ## End(Not run)
This function retrieves object summary data from a given NCBI database.
ncbi_get_summary( query, db = NULL, batch_size = 100, verbose = getOption("verbose") )ncbi_get_summary( query, db = NULL, batch_size = 100, verbose = getOption("verbose") )
query |
either an object of class |
db |
character; the database to search in. For options see
|
batch_size |
integer; the number of items to query at once. If query length is larger than 'batch_size', the query will be split into batches. |
verbose |
logical; should verbose messages be printed to the console? |
If query is an 'ncbi_uid' object, the 'db' argument is optional. If 'db' is not specified, the function will retrieve it from the query object. However, if it is specified, it must be identical to the 'db' attribute of the query.
A list of rentrez summary objects.
## Not run: assemblies <- c("GCF_000002435.2", "GCF_000299415.1") uids <- ncbi_get_uid(assemblies, db = "assembly") ncbi_get_summary(uids) ## End(Not run)## Not run: assemblies <- c("GCF_000002435.2", "GCF_000299415.1") uids <- ncbi_get_uid(assemblies, db = "assembly") ncbi_get_summary(uids) ## End(Not run)
This function replicates the NCBI website's search utility. It searches one or more search terms in the chosen database and returns internal NCBI UID-s for the hits. These can be used e.g. to link NCBI entries with entries in other NCBI databases or to retrieve the data itself.
ncbi_get_uid( term, db, batch_size = 100, use_history = TRUE, na_strings = "NA", verbose = getOption("verbose") )ncbi_get_uid( term, db, batch_size = 100, use_history = TRUE, na_strings = "NA", verbose = getOption("verbose") )
term |
character; one or more search terms. |
db |
character; the database to search in. For options see
|
batch_size |
integer; the number of search terms to query at once. If
the number of search terms is larger than |
use_history |
logical; should the function use web history for faster API queries? |
na_strings |
character; a vector of strings which should be interpreted as 'NA'. |
verbose |
logical; should verbose messages be printed to the console? |
The default value for batch_size should work in most cases.
However, if the search terms are very long, the function may fail with an
error message. In this case, try reducing the batch_size value.
An object of class "ncbi_uid" which is a list with three
elements:
uid: a vector of UIDs.
db: the database used for the query.
web_history: a tibble of web histories.
ncbi_get_uid("GCA_003012895.2", db = "assembly") ncbi_get_uid("Autographiviridae OR Podoviridae", db = "biosample") ncbi_get_uid(c("WP_093980916.1", "WP_181249115.1"), db = "protein")ncbi_get_uid("GCA_003012895.2", db = "assembly") ncbi_get_uid("Autographiviridae OR Podoviridae", db = "biosample") ncbi_get_uid(c("WP_093980916.1", "WP_181249115.1"), db = "protein")
Each entry in an NCBI database has its unique ID. Entries in different databases may be linked. For example, entries in the NCBI Assembly database may be linked with entries in the NCBI BioSample database. This function attempts to link ID-s from one database to another.
ncbi_link( query, from, to, multiple = "all", batch_size = 100, verbose = getOption("verbose") )ncbi_link( query, from, to, multiple = "all", batch_size = 100, verbose = getOption("verbose") )
query |
character; a vector of IDs |
from |
character; the database the queried ID-s come from.
|
to |
character; the database in which the function should look for links.
|
multiple |
character; handling of rows in x with multiple matches in y. For more information see '?dplyr::left_join()'. |
batch_size |
integer; the number of search terms to query at once. If
the number of search terms is larger than |
verbose |
logical; should verbose messages be printed to the console? |
A tibble with two columns. The first column contains IDs in the 'from' database, the second column contains linked IDs in the 'to' database.
## Not run: ncbi_link("GCF_000002435.2", from = "assembly", to = "biosample") ncbi_link("SAMN02714232", from = "biosample", to = "assembly") ## End(Not run)## Not run: ncbi_link("GCF_000002435.2", from = "assembly", to = "biosample") ncbi_link("SAMN02714232", from = "biosample", to = "assembly") ## End(Not run)
Each entry in an NCBI database has its unique internal id. Entries in different databases may be linked. For example, entries in the NCBI Assembly database may be linked with entries in the NCBI BioSample database. This function attempts to link uids from one database to another.
ncbi_link_uid( query, from = NULL, to, batch_size = 100, verbose = getOption("verbose") )ncbi_link_uid( query, from = NULL, to, batch_size = 100, verbose = getOption("verbose") )
query |
either an object of class 'ncbi_uid' or 'ncbi_uid_link', or an integer vector of UIDs. See Details for more information. |
from |
character; the database the queried UIDs come from.
|
to |
character; the database in which the function should look for links.
|
batch_size |
integer; the number of search terms to query at once. If
the number of search terms is larger than |
verbose |
logical; should verbose messages be printed to the console? |
The function can take three query classes: It can take 'ncbi_uid' objects, these are returned by 'ncbi_get_uid()'. In this case, the 'from' argument will be retrieved from the query object, by default. It can also take 'ncbi_uid_link' objects, which means 'ncbi_link_uid()' can be called several times in a sequence to perform a number of successive conversions. When the query is an 'ncbi_uid_link' object, the function will always convert the UIDs in the last column of the query object, and will retrieve the 'from' argument from the name of the last column. This means links should always be interpreted "left-to-right". Note, when tibbles are joined during subsequent 'ncbi_link_uid' calls they are joined using "many-to-many" relationships; see '?dplyr::left_join()' for more information. Lastly, the function can also take a vector of integer UIDs.
A tibble with two or more columns. When 'ncbi_link_uid()' is called on a 'ncbi_uid' object or a vector of UIDs, the function returns a tibble with exactly two columns: the first column contains UIDs in the 'from' database, and the second column contains linked UIDs in the 'to' database. However, 'ncbi_link_uid()' can be called multiple times in succession. Each call after the first call will add a new column to the returned tibble. See Details for more information.
# Simple call with integer UIDs ncbi_link_uid(5197591, "assembly", "biosample") ncbi_link_uid(c(1226742659, 1883410844), "protein", "nuccore") # Complex call with ncbi_get_uid() and several ncbi_link_uid() calls "GCF_000299415.1" |> ncbi_get_uid(db = "assembly") |> ncbi_link_uid(to = "biosample") |> ncbi_link_uid(to = "bioproject") |> ncbi_link_uid(to = "pubmed")# Simple call with integer UIDs ncbi_link_uid(5197591, "assembly", "biosample") ncbi_link_uid(c(1226742659, 1883410844), "protein", "nuccore") # Complex call with ncbi_get_uid() and several ncbi_link_uid() calls "GCF_000299415.1" |> ncbi_get_uid(db = "assembly") |> ncbi_link_uid(to = "biosample") |> ncbi_link_uid(to = "bioproject") |> ncbi_link_uid(to = "pubmed")
Retrieve NCBI Assembly metadata
ncbi_meta_assembly(assembly_uid)ncbi_meta_assembly(assembly_uid)
assembly_uid |
numeric |
## Not run: ncbi_meta_assembly(419738) ## End(Not run)## Not run: ncbi_meta_assembly(419738) ## End(Not run)
This function can be used to parse various retrieved non-sequence data sets from NCBI into a tibble. These data sets usually accompany the biological sequences and contain additional information e.g. identifiers, information about the sample, the sequencing platform, etc.
ncbi_parse( meta, db = NULL, format = "xml", mc_cores = NULL, verbose = getOption("verbose") )ncbi_parse( meta, db = NULL, format = "xml", mc_cores = NULL, verbose = getOption("verbose") )
meta |
character; either an unparsed metadata object returned by
|
db |
character; the NCBI database from which the data was retrieved. |
format |
character; the format of the data set. Currently only
|
mc_cores |
integer; Number of cores to use for parallel processing. |
verbose |
logical; Should verbose messages be printed to console? |
This function is integrated into ncbi_get_meta() and is
called automatically if parse = TRUE (default). However, it can also
be used separately e.g. when you want to examine the unparsed metadata
object before parsing, or when you already downloaded the metadata manually
and you just want to parse it into a tabular format.
If meta is an unparsed ncbi_meta object returned by
ncbi_get_meta() then the db argument is optional. If db
is not specified, the function will extract it automatically. However, if it
is specified, it must be identical to the db attribute of the metadata
object. If meta is not an ncbi_meta object, the db
argument is required.
a tibble.
## Not run: data(examples) #' # NCBI Assembly, download XML file from NCBI and parse # Manually download the XML file # https://www.ncbi.nlm.nih.gov/assembly/GCF_000299415.1 # upper right corner -> send to -> file -> format = xml -> create file # Parse XML ncbi_parse(meta = "assembly_summary.xml", db = "assembly", format = "xml") # NCBI BioSample, fully programmatic access, separate retrieval and parsing # Get metadata but do not parse meta <- ncbi_get_meta(examples$biosample, db = "biosample", parse = FALSE) # Parse metadata separately, the function will extract 'db' automatically. ncbi_parse(meta = meta) # NCBI BioSample, download XML file from NCBI and parse # Manually download the XML file # https://www.ncbi.nlm.nih.gov/biosample/?term=SAMN02714232 # upper right corner -> send to -> file -> format = full (xml) -> create file # Parse XML ncbi_parse(meta = "biosample_result.xml", db = "biosample", format = "xml") ## End(Not run)## Not run: data(examples) #' # NCBI Assembly, download XML file from NCBI and parse # Manually download the XML file # https://www.ncbi.nlm.nih.gov/assembly/GCF_000299415.1 # upper right corner -> send to -> file -> format = xml -> create file # Parse XML ncbi_parse(meta = "assembly_summary.xml", db = "assembly", format = "xml") # NCBI BioSample, fully programmatic access, separate retrieval and parsing # Get metadata but do not parse meta <- ncbi_get_meta(examples$biosample, db = "biosample", parse = FALSE) # Parse metadata separately, the function will extract 'db' automatically. ncbi_parse(meta = meta) # NCBI BioSample, download XML file from NCBI and parse # Manually download the XML file # https://www.ncbi.nlm.nih.gov/biosample/?term=SAMN02714232 # upper right corner -> send to -> file -> format = full (xml) -> create file # Parse XML ncbi_parse(meta = "biosample_result.xml", db = "biosample", format = "xml") ## End(Not run)
This function can be used to parse an xml file from the NCBI assembly database into a tibble.
ncbi_parse_assembly_xml(file, verbose = getOption("verbose"))ncbi_parse_assembly_xml(file, verbose = getOption("verbose"))
file |
character; path to an xml file. |
verbose |
logical; Should verbose messages be printed to console? |
a tibble.
## Not run: # search for Acinetobacter baumannii within the NCBI Assembly database # https://www.ncbi.nlm.nih.gov/assembly/?term=acinetobacter%20baumannii # upper right corner -> send to -> file -> format = xml -> create file # parse the downloaded file ncbi_parse_assembly_xml("assembly_summary.xml") ## End(Not run)## Not run: # search for Acinetobacter baumannii within the NCBI Assembly database # https://www.ncbi.nlm.nih.gov/assembly/?term=acinetobacter%20baumannii # upper right corner -> send to -> file -> format = xml -> create file # parse the downloaded file ncbi_parse_assembly_xml("assembly_summary.xml") ## End(Not run)
This function parses a txt file from the NCBI BioSample database.
ncbi_parse_biosample_txt( file, resolve_na = TRUE, verbose = getOption("verbose") )ncbi_parse_biosample_txt( file, resolve_na = TRUE, verbose = getOption("verbose") )
file |
character; path to a txt file. |
resolve_na |
logical; replace strings that match NA terms with NA. |
verbose |
logical; should verbose output be printed to console? |
a tibble.
## Not run: # search for Acinetobacter baumannii within the NCBI BioSample database # https://www.ncbi.nlm.nih.gov/biosample/?term=acinetobacter+baumannii # upper right corner -> send to -> file -> format = full (text) -> create file # parse the downloaded file ncbi_parse_biosample_txt("biosample_summary.txt") ## End(Not run)## Not run: # search for Acinetobacter baumannii within the NCBI BioSample database # https://www.ncbi.nlm.nih.gov/biosample/?term=acinetobacter+baumannii # upper right corner -> send to -> file -> format = full (text) -> create file # parse the downloaded file ncbi_parse_biosample_txt("biosample_summary.txt") ## End(Not run)
BioSample metadata from NCBI can be retrieved in multiple file formats. This function parses metadata retrieved in XML format.
ncbi_parse_biosample_xml( biosample_xml, mc_cores = NULL, verbose = getOption("verbose") )ncbi_parse_biosample_xml( biosample_xml, mc_cores = NULL, verbose = getOption("verbose") )
biosample_xml |
character; unparsed XML metadata either returned by
|
mc_cores |
integer; number of cores to use for parallel processing. If
|
verbose |
logical; Should verbose messages be printed to console? |
This function posts a vector of NCBI UIDs to the NCBI webservice. The purpose of this function is to generate an ncbi_uid object with web history which can then be used in downstream functions.
ncbi_post_uid(uid, db, batch_size = 100, verbose = getOption("verbose"))ncbi_post_uid(uid, db, batch_size = 100, verbose = getOption("verbose"))
uid |
numeric; a vector of NCBI UIDs. |
db |
character; the database the UIDs belong to. For options see 'ncbi_dbs()'. |
batch_size |
integer; the number of search terms to query at once. If
the number of search terms is larger than |
verbose |
logical; should verbose messages be printed to the console? |
The default value for batch_size should work in most cases.
However, if the vector of UIDs is very long, the function may fail with an
error message. In this case, try reducing the batch_size value.
An object of class "ncbi_uid" which is a list with three
elements:
uid: a vector of UIDs.
db: the database the UIDs belong to.
web_history: a tibble of web histories.
In webseq NCBI UIDs are acquired using 'ncbi_get_uid()' or 'ncbi_link_uid()' and are used for acquiring linked UIDs in other NCBI databases or for retrieving the data itself. In some cases data retrieval is faster when using web history. However, you may have UIDs without a web history or your web history may expire. In this case you can use 'ncbi_post_uid()' to generate a new ncbi_uid object with web history.
ncbi_post_uid(4505768, db = "biosample")ncbi_post_uid(4505768, db = "biosample")
Many functions that interact with NCBI return UIDs. This function converts the UIDs to NCBI IDs.
ncbi_recover_id( query, db = NULL, batch_size = 100, verbose = getOption("verbose") )ncbi_recover_id( query, db = NULL, batch_size = 100, verbose = getOption("verbose") )
query |
either an object of class |
db |
character; the database to search in. For options see
|
batch_size |
integer; the number of items to query at once. If query length is larger than 'batch_size', the query will be split into batches. |
verbose |
logical; should verbose messages be printed to the console? |
If query is an 'ncbi_uid' object, the 'db' argument is optional. If 'db' is not specified, the function will retrieve it from the query object. However, if it is specified, it must be identical to the 'db' attribute of the query.
A vector of matching IDs.
## Not run: uid <- ncbi_get_uid("GCF_000002435.2", db = "assembly") ncbi_recover_id(uid) ## End(Not run)## Not run: uid <- ncbi_get_uid("GCF_000002435.2", db = "assembly") ncbi_recover_id(uid) ## End(Not run)
Take a vector of NCBI accessions and convert them to ENA accessions.
ncbi2ena(accessions, type)ncbi2ena(accessions, type)
accessions |
character; a vector or ENA accessions. |
type |
character; type of NCBI accessions. Supported types: 'biosample', 'bioproject'. |
A tibble with two columns, 'ncbi' and 'ena'.
ncbi2ena("SAMEA111452506", type = "biosample")ncbi2ena("SAMEA111452506", type = "biosample")
This function can be used to parse a downloaded assembly report.
parse_report(file)parse_report(file)
file |
character; the file path to the assembly report. |
The function returns an object of classes arpt and
list. The unique class is required for compatibility with subsequent
functions in the pipeline. Otherwise data from the returned object can be
extracted through general list operations.
This is the fourth step within the pipeline for downloading GenBank files.
get_genomeid,
get_report_url(),
download_report(),
extract_accn(),
download_gb()
## Not run: phages <- get_genomeid("Autographiviridae", db = "assembly") report_url <- get_report_url(phages$ids[1]) download_report(report_url) filename <- dir(paste0(tempdir(), "/assembly_reports")) filepath <- paste0(tempdir(), "/assembly_reports/", filename) parse_report(filepath) ## End(Not run)## Not run: phages <- get_genomeid("Autographiviridae", db = "assembly") report_url <- get_report_url(phages$ids[1]) download_report(report_url) filename <- dir(paste0(tempdir(), "/assembly_reports")) filepath <- paste0(tempdir(), "/assembly_reports/", filename) parse_report(filepath) ## End(Not run)