In biological research, data analysis is becoming increasingly important. Any researcher will need external information such as a reference genome or specific gene annotation to study gene expression in a sample collection or transcription factor binding patterns in genomic atlases. There are three main providers for this type of data: Ensembl1, UCSC2, and NCBI3, as well as several model system-specific providers such as GENCODE4, ZFIN5, FlyBase6, WormBase7, Xenbase8, and others. Different vendors use different ways of assembling genome assemblies and gene annotations, which affects formats, format conformance, terminology, data quality, available versions, and release cycle.

These distinctions greatly influence compatibility with research9, tools and (data derived from) other genetic data. One can search for genetic data independently, but there are several choices. Together, UCSC and NCBI provide FTP archives, online portals, and REST APIs to search their respective datasets. Alternatively, programs such as NCBI-genome-download10 and ucscgenomes-downloader 11 can be used to access some of these datasets programmatically. However, none of these can search, compare or download data from all important genomic sources. Additionally, manually obtaining and analyzing genetic data can be time-consuming, error-prone, and difficult to replicate.

Although the latter can be solved with a data management tool such as iGenomes12, refGenie13 or Go Get Data14, most data managers still require the user to manually enter new data. They created genomics to

  1. identify genomic data on the main suppliers,
  2. compare gene annotations,
  3. choose the most suitable genomic data for their investigation, and
  4. give a set of utilities to browse and modify data.

Data can be retrieved from any location and processed automatically.

Data sources and processing procedures are recorded to ensure repeatability, which can be further increased by using a data manager. Genomic data can be fed into genomepy, which works with and extends programs such as pyfaidx15, pandas16 and MyGene.info17 to work quickly with gene and genome sequences and information. Genomepy can be used from the command line and via the (well-documented) Python API for one-off analysis or integration into pipelines and workflow managers. Genomepy has also been integrated with other programs, such as pybedtools18 and CellOracle19.

Obtaining relevant genetic data is a crucial step in any genomics research. A genome with the required sequence masking, biological diversity and contigs can be obtained with a single command. Genomepy makes it easy to find and download available assemblies. Genetic annotations in GTF and BED12 format that correspond to the genome can be retrieved in the same way, with additional choices accessible via the Python API. All installation choices one selects are recorded for repeatability, allowing them to begin their study with confidence.

You can install genomepy via bioconda:

conda install -c bioconda genomepy

Or via pip with Python 3.7+:

A working example of the genome library
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'genomepy: genes and genomes at your fingertips'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and code.

Please Don't Forget To Join Our ML Subreddit


Consultant intern in content writing at Marktechpost.