WhatsGNU

License: GPL v3 Build Status Anaconda-Server Badge Anaconda-Server Badge

WhatsGNU

What’s Gene Novelty Unit: A Tool For Identifying Proteomic Novelty.

Introduction

WhatsGNU utilizes the natural variation in public databases to rank protein sequences based on the number of observed exact protein matches (the GNU score) in all known genomes of a certain species & can quickly create whole protein reports.
WhatsGNU compresses proteins database based on exact match to much fewer number of proteins that differ by at least one amino acid. WhatsGNU will save a copy of the compressed database in two formats; database.txt and database.pickle for faster subsequent uses.

Installation

Dependencies

If you do not have Miniconda or Anaconda installed already, you can install one of them from:

  1. Miniconda
  2. Anaconda

    Windows

    Follow instructions for installing Windows Subsystem for Linux (WSL) on https://docs.microsoft.com/en-us/windows/wsl/install-win10
    Briefly:

  3. Open PowerShell as Administrator and run:
    Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
    
  4. Install Linux distribution app from Microsoft Store (tested on Ubuntu 18.04 LTS).
  5. Set up username and password.
  6. Update the system and install dependencies:
    sudo apt update && sudo apt upgrade
    sudo apt install python3-pip
    pip3 install --user numpy scipy matplotlib ipython jupyter pandas sympy nose
    sudo apt install unzip
    sudo apt install ncbi-blast+
    git clone https://github.com/ahmedmagds/WhatsGNU.git
    export PATH=$PATH:/home/user_name/WhatsGNU/bin
    

    Note: Your Windows C:\Users\ gets mapped to /mnt/c/Users/ in WSL. You can copy between the two directories using a command like:

    cp /mnt/c/Users/Windows_username/Desktop/file.fasta /home/Ubuntu_user_name/
    

Test

Available Databases

There are three different types of databases available to use: basic, ortholog, or hashed basic databases. At this time, hashed ortholog databases are not available for use, but will be in the future. For more information on the uses and limitations of hashed databases, skip to the WhatsGNU_main_hashes.py section under WhatsGNU toolbox.

The following databases are available to download and use:

Ortholog Mode:

  1. Klebsiella pneumoniae Version: 04/17/2020 (compressed 46,072,343 proteins in 8752 genomes to 1,466,934 protein variants). Updated April 2023.
  2. Pseudomonas aeruginosa Version: 07/06/2019 (compressed 14,475,742 proteins in 4712 genomes to 1,288,892 protein variants)
  3. Mycobacterium tuberculosis Version: 07/09/2019 (compressed 26,794,006 proteins in 6563 genomes to 434,725 protein variants).
  4. Staphylococcus aureus Version: April 2024, Size: 14GB (compressed 188,965,356 proteins in 68,299 genomes to 2,702,458 protein variants)
  5. C.difficile Version: July 2024, Size: 3.8GB (compressed 55,048,119 proteins in 14,186 genomes to 617,095 protein variants)

Basic Mode:

  1. Salmonella enterica Enterobase Version: 08/29/2019 (compressed 975,262,506 proteins in 216,642 genomes to 5,056,335 protein variants)
  2. Pseudomonas aeruginosa Version: June 2024, Size: 19GB (compressed 198,278,793 proteins in 31,832 genomes to 3,537,663 protein variants)
  3. Klebsiella pnuemoniae Version: June 2024, Size: 37GB (compressed 405,201,811 proteins in 75,246 genomes to 4,425,185 protein variants)
  4. Escherichia coli Version: March 2024, Size: 90 GB (compressed 1,044,408,936 proteins in 211,942 genomes to 15,220,801 protein variants)

Hashed Databases:

Note: Metadata (i.e. number of genomes, protein variants, etc) is the same as above for each of the following species.

  1. Escherichia coli Size: 7.5GB
  2. Pseudomonas aeruginosa Size: 1.4GB
  3. Klebsiella pnuemoniae Size: 2.6GB
  4. RefSeq Version: July 2023, Size: 27 GB (compressed 1,166,846,405 proteins in 306,326 genomes to 229,663,320 protein variants)

The databases are available to download by visiting the link or using the wget command. Examples of how to use the wget command as follows:

S. aureus Ortholog

Mycobacterium tuberculosis Ortholog

wget -O TB.zip https://www.dropbox.com/sh/8nqowtd4fcf7dgs/AAAdXiqcxTsEqfIAyNE9TWwRa?dl=0
unzip TB.zip -d WhatsGNU_TB_Ortholog

Pseudomonas aeruginosa Ortholog

wget -O Pa.zip https://www.dropbox.com/sh/r0wvoig3alsz7xg/AABPoNu6FdN7zG2PP9BFezQYa?dl=0
unzip Pa.zip -d WhatsGNU_Pa_Ortholog

S. enterica Enterobase

wget -O Senterica_Enterobase_basic_216642.pickle https://www.dropbox.com/s/gbjengikpynxo12/Senterica_Enterobase_basic_216642.pickle?dl=0

Klebsiella pneumoniae hashed

wget https://zenodo.org/records/13384718/files/Kp_basic.tar.gz
tar xvfz Kp_basic.tar.gz

WhatsGNU toolbox

  1. WhatsGNU_get_GenBank_genomes.py

    This script downloads genomic fna files or protein faa files from GenBank.

  2. WhatsGNU_database_customizer.py

    This script customizes the protein faa files from GenBank, RefSeq, Prokka and RAST by adding a strain name to the start of each protein. This script can also customize the strain names for gff file to be used in Roary for pangenome analysis, if the Ortholog mode is going to be used in WhatsGNU.

  3. WhatsGNU_db_download.py

    This script will download databases for WhatsGNU. You can check all databases available for WhatsGNU in the file databases_available.csv.

  4. WhatsGNU_main.py

    In basic mode, this script ranks protein sequences based on the number of observed exact protein matches (the GNU score) in all known genomes of a particular species. It generates a report for all the proteins in your query in seconds using exact match compression technique. In ortholog mode, the script will additionally link the different alleles of an ortholog group using the clustered proteins output file from Roary or similar pangenome analysis tools. In this mode, WhatsGNU will calculate Ortholog Variant Rarity Index (OVRI) (scale 0-1). This metric is calculated as the number of alleles in an orthologous group that have a GNU score less than or equal to the GNU score of any given allele divided by the sum of GNU scores in the orthologous group. This index represents how unusual a given GNU score is within an ortholog group by measuring how many other protein alleles in the ortholog group have that GNU score or lower. For instance, an allele of GNU=8 in an ortholog group that has 6 alleles with this distribution of GNU scores [300,20,15,8,2,1] will get an OVRI of (8+2+1)/346= 0.03. On the other hand, the allele with GNU=300 will get an OVRI of (300+20+15+8+2+1)/346= 1. An allele with an OVRI of 1 is relatively common regardless of the magnitude of the GNU score, while an allele with OVRI of 0.03 is relatively rare. This index helps distinguish between ortholog groups with high levels of diversity and ortholog groups that are highly conserved.

  5. WhatsGNU_plotter.py

    This script plots:

    • Heatmap of GNU scores of orthologous genes in different isolates.
    • Metadata distribution bar plot of proteins.
    • Histogram of the GNU scores of all proteins in a genome.
    • Volcano plot showing proteins with a lower average GNU score in one group (case) compared to the other (control). The x-axis is the delta average GNU score (Average_GNU_score_case – Average_GNU_score_control) in the ortholog group. Lower average GNU score in cases will have a negative value on the x-axis (red dots) while lower average GNU score in the control group will have positive value on the x-axis (green dots). The y-axis could be drawn as a -log10(P value) from Mann–Whitney-Wilcoxon test. In this case, lower average GNU score in one group (upper left for case or upper right for control) would be of interest as shown by a significant P value (-log10( P value) > 1.3). The y-axis can also be the average OVRI in the case group for negative values on the x-axis or average OVRI in the control group for positive values on the x-axis.
  6. WhatsGNU_main_hashes.py

    This script is compatible only with the hashed versions of the databases. Each hashed database comes with a CSV file that is necessary to be able to run this script. The corresponding CSV for each hashed database can be found in the respective gzipped tarball. Functions available in this version of the script include generating a basic WhatsGNU report (see below for formatting of report), creating a file of each protein with all associated ids from the database (-i) and creating a file with the top genomes (-t/-tn). With this script, you cannot run blastp on the proteins with GNU score of zero (i.e. -b, –blastp option is not available with this script) at this time.

Usage for WhatsGNU_db_download.py

Input

  1. database name (e.g. Sau, Kp, TB, Pa, Staphopia, S.enterica or all)
    WhatsGNU_db_download.py Sau
    

Usage for WhatsGNU_main.py

Input

  1. database (precompressed (.pickle or .txt) or raw (.faa)).
  2. Query protein FASTA file (.faa) or folder of query files.

Optional for S. aureus: The CSV file of Metadata (CC/ST) frequencies for the S. aureus database.

Use precompressed databases

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog query.faa
or
WhatsGNU_main.py -d Senterica_Enterobase_basic_216642.pickle -dm basic query.faa

You can also use a folder of multiple .faa query files as input (e.g. folder_faa/ has all .faa files to be processed)

WhatsGNU.py -d TB_Ortholog_6563.pickle -dm ortholog folder_faa/

Use precompressed databases with more features

You can assign output folder name using -o instead of default (WhatsGNU_results_timestamp)

WhatsGNU_main.py -d Sau_Staphopia_basic_43914.pickle -dm basic -o output_results_folder query.faa

Create a file of each protein with all associated ids from the database (Note: large file (~ 1 Gb for 3000 proteins))

WhatsGNU_main.py -d Pa_Ortholog_4713.pickle -dm ortholog -i -o output_results_folder query.faa

Create a file of top 10 genomes with hits

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog -t query.faa

Check how many hits you get from a particular genome in the database (It has to be used with -t). The names of the different strains in the databases and their corresponding Genbank strain name and GCA number are available from List of Genomes included

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog -t -s FDAARGOS_31_GCA_001019015.2_CC8_ query.faa

Get Metadata (CC/ST) composition of your hits in the report (Only for S. aureus and you will need to use the metadata_frequencies.csv file (available to download with the database) with -e)

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog -e metadata_frequencies.csv query.faa

Get a fasta (.faa) file of all proteins with GNU score of zero.

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog -f query.faa

The following options work with -dm ortholog

Run blastp on the proteins with GNU score of zero and modify the report with ortholog information.

WhatsGNU_main.py -d Pa_Ortholog_4713.pickle -dm ortholog -b query.faa

Note: If -b is used, WhatsGNU will search for compressed_db_orthologs.faa and compressed_db_orthologs_info.txt in the same path for the compressed database as they are needed for the blastp run.

Get the output report of blastp run (works with -b).

WhatsGNU_main.py -d Pa_Ortholog_4713.pickle -dm ortholog -b -op query.faa

Select a blastp percent identity and coverage cutoff values [Default 80], range(0,100).

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog -b –w 90 –c 50 query.faa

Select an OVRI cutoff value [Default 0.045], range (0-1).

WhatsGNU_main.py -d TB_Ortholog_6563.pickle -dm ortholog -ri 0.09 query.faa

Use all features together

WhatsGNU_main.py -d Sau_Ortholog_10350.pickle -dm ortholog -o output_results_folder -i -t -s strain_name -e metadata_frequencies.csv -f -b -op –w 95 –c 40 -ri 0.09 query.faa

Command line options

WhatsGNU_main.py -h
usage: WhatsGNU_main.py [-h] [-m MKDATABASE | -d DATABASE] [-a] [-j]
                        [-r [ROARY_CLUSTERED_PROTEINS]] [-dm {ortholog,basic}]
                        [-ri [RARITY_INDEX]] [-o OUTPUT_FOLDER] [--force]
                        [-p PREFIX] [-t] [-s STRAINHITS] [-e METADATA] [-i]
                        [-f] [-b] [-op] [-w [PERCENT_IDENTITY]]
                        [-c [PERCENT_COVERAGE]] [-q] [-v]
                        query_faa

WhatsGNU v1.0 utilizes the natural variation in public databases to rank
protein sequences based on the number of observed exact protein matches
(the GNU score) in all known genomes of a particular species. It generates a
report for all the proteins in your query in seconds.

positional arguments:
  query_faa             Query protein FASTA file/s to analyze (.faa)

optional arguments:
  -h, --help            show this help message and exit
  -m MKDATABASE, --mkdatabase MKDATABASE
                        you have to provide path to faa file or a folder of
                        multiple faa files for compression
  -d DATABASE, --database DATABASE
                        you have to provide path to your compressed database
  -a, --pickle          Save database in pickle format [Default only txt file]
  -j, --sql             Save database in SQL format for large Databases
                        [Default only txt file]
  -r [ROARY_CLUSTERED_PROTEINS], --roary_clustered_proteins [ROARY_CLUSTERED_PROTEINS]
                        clustered_proteins output file from roary to be used
                        with -m
  -dm {ortholog,basic}, --database_mode {ortholog,basic}
                        select a mode from 'ortholog' or 'basic' to be used
                        with -d
  -ri [RARITY_INDEX], --rarity_index [RARITY_INDEX]
                        select an ortholog variant rarity index (OVRI) cutoff
                        value in range (0-1)[0.045] for ortholog mode
  -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
                        Database output prefix to be created for results
                        (default: timestamped WhatsGNU_results in the current
                        directory)
  --force               Force overwriting existing results folder assigned
                        with -o (default: off)
  -p PREFIX, --prefix PREFIX
                        Prefix for output compressed database (default:
                        WhatsGNU_compressed_database)
  -t, --topgenomes      create a file of top 10 genomes with hits
  -s STRAINHITS, --strainhits STRAINHITS
                        check how many hits you get from a particular
                        strain,it has to be used with -t
  -e METADATA, --metadata METADATA
                        get the metadata composition of your hits, use the
                        metadata_frequency.csv file produced by the WhatsGNU
                        customizer script
  -i, --ids_hits        create a file of each protein with locus_tags (ids) of
                        all hits from the database, large file (~ 1 Gb for
                        3000 pts)
  -f, --faa_GNU_0       get a fasta (.faa) file of all proteins with GNU score
                        of zero
  -b, --blastp          run blastp on the proteins with GNU score of zero and
                        modify the report with ortholog_info, blastp has to be
                        installed
  -op, --output_blastp  get the output report of blastp run, it has to be used
                        with -b
  -w [PERCENT_IDENTITY], --percent_identity [PERCENT_IDENTITY]
                        select a blastp percent identity cutoff value [80],
                        range(0,100)
  -c [PERCENT_COVERAGE], --percent_coverage [PERCENT_COVERAGE]
                        select a blastp percent coverage cutoff value [80],
                        range(0,100)
  -q, --quiet           No screen output [default OFF]
  -v, --version         print version and exit

Output

Always with -m or -d

query_WhatsGNU_report_v1.txt (tab-separated output file)

Basic Mode

protein | GNU score | length | function | sequence | ——- | ——— | —— | ——– | ——– | strain_x_protein_1 | 2 | 3 | argG | MVM |

Ortholog Mode (in addition to the previous five columns)
ortholog_group ortho_gp_total_sequences_number ortho_gp_total_variants_number minimum_GNU maximum_GNU average_GNU OVRI OVRI interpretation
argG 100 5 2 50 38 0.02 rare

Explanation for the columns in the report: For instance, if strain_x_protein_1 (sequence: MVM) belongs to argG orthologous group which has 5 protein variants (MMMM,MVVM, MVM, MVV and VVM) with GNU scores [50,35,10,3,2]:

Note: If -e option is used for S. aureus, CC/ST percentages’ columns will be added to the report.

WhatsGNU_date_time.log (Log file, e.g. WhatsGNU_v1_20190209_183406.log)

Always with -m

Optional

Option | File | Description —— | —- | ———– -i | query_WhatsGNU_hits.txt | each protein with all hits_ids from the database,large file (~ 1 Gb for S. aureus) -t | query_WhatsGNU_topgenomes.txt | top 10 genomes with hits to your query -f | query_WhatsGNU_zeros.faa | file of all proteins with GNU score of zero -op | query_WhatsGNU_zeros_blast_report.txt | output report of blastp run

Usage for WhatsGNU_plotter.py

Input

A folder of query_WhatsGNU_report.txt files.

Heatmap

Plot a heatmap of GNU scores for these proteins in proteins.faa using this strains’ order. Assign a title using -t. Font size and figure size (w,h) are given by -f and -fs, respectively. Annotate the heatmap cells with OVRI rare tag using -r option.

WhatsGNU_plotter.py -hp ortholog -q proteins.faa -r -d strains_order.txt -t title -r -f 14 -fs 14 10 prefix_name WhatsGNU_reports_folder/

Metadata percentage distribution

Plot a metadata percentage bar plot for the GNU scores of the proteins in proteins.faa for each WhatsGNU report.

WhatsGNU_plotter.py -mb basic -q proteins.faa prefix_name WhatsGNU_reports_folder/

Histogram

Plot a blue histogram of the GNU scores for each WhatsGNU report using 100 bins and get a text file showing novel and conserved proteins with -p option to assign cutoffs.

WhatsGNU_plotter.py -x -e blue -b 100 -p 50 5000 prefix_name WhatsGNU_reports_folder/

Volcano plot

Plot two scatterplots that shows either statistical significance (P value) or average OVRI versus magnitude of change (Delta_average_GNU_Score). The case/control tag is provided in isolates_case_control_tag.csv. The option -c 100 is a percentage of isolates a protein must be in to be included. A summary statistics file is also created.

WhatsGNU_plotter.py -st isolates_case_control_tag.csv -c 100 prefix_name WhatsGNU_reports_folder/

All features together

WhatsGNU_plotter.py -hp ortholog -q proteins.faa -d strains_order.txt -t title -r -f 16 -fs 14 10 -mb ortholog -x -e blue -b 100 -st isolates_case_control_tag.csv -c 100 prefix_name WhatsGNU_reports_folder/

Command line options

WhatsGNU_plotter.py -h
usage: WhatsGNU_plotter.py [-h] [-hp {ortholog,basic}] [-l LIST_GENES]
                           [-q FASTA] [-op] [-d STRAINS_ORDER] [-r]
                           [-rc RARITY_COLOR] [-fs FIGURE_SIZE FIGURE_SIZE]
                           [-hc HEATMAP_COLOR] [-mc MASKED_COLOR]
                           [-f FONT_SIZE] [-t TITLE] [-mb {ortholog,basic}]
                           [-w] [-s SELECT_METADATA] [-x] [-e HISTOGRAM_COLOR]
                           [-b HISTOGRAM_BINS]
                           [-p NOVEL_CONSERVED NOVEL_CONSERVED]
                           [-st STRAINS_TAG_VOLCANO] [-c CUTOFF_VOLCANO]
                           [-cc CASE_CONTROL_NAME CASE_CONTROL_NAME]
                           prefix_name directory_path

WhatsGNU_plotter script for WhatsGNU v1.0.

positional arguments:
  prefix_name           prefix name for the the output folder and
                        heatmap/volcano output files
  directory_path        path to directory of WhatsGNU reports

optional arguments:
  -h, --help            show this help message and exit
  -hp {ortholog,basic}, --heatmap {ortholog,basic}
                        heatmap of GNU scores for orthologous genes in
                        multiple isolates
  -l LIST_GENES, --list_genes LIST_GENES
                        a txt file of ortholog group names from one of the
                        WhatsGNU reports for heatmap
  -q FASTA, --fasta FASTA
                        a FASTA file of sequences for the proteins of interest
                        for heatmap or metadata barplot
  -op, --output_blastp  get the output report of blastp run, it has to be used
                        with -q
  -d STRAINS_ORDER, --strains_order STRAINS_ORDER
                        list of strains order for heatmap
  -r, --rarity          Annotate heatmap cells with OVRI(default: off)
  -rc RARITY_COLOR, --rarity_color RARITY_COLOR
                        OVRI data text color in the heatmap
  -fs FIGURE_SIZE FIGURE_SIZE, --figure_size FIGURE_SIZE FIGURE_SIZE
                        heatmap width and height in inches w,h, respectively
  -hc HEATMAP_COLOR, --heatmap_color HEATMAP_COLOR
                        heatmap color
  -mc MASKED_COLOR, --masked_color MASKED_COLOR
                        missing data color in heatmap
  -f FONT_SIZE, --font_size FONT_SIZE
                        heatmap font size
  -t TITLE, --title TITLE
                        title for the heatmap [Default:WhatsGNU heatmap]
  -mb {ortholog,basic}, --metadata_barplot {ortholog,basic}
                        Metadata percentage distribution for proteins in a
                        FASTA file
  -w, --all_metadata    all metadata
  -s SELECT_METADATA, --select_metadata SELECT_METADATA
                        select some metadata
  -x, --histogram       histogram of GNU scores
  -e HISTOGRAM_COLOR, --histogram_color HISTOGRAM_COLOR
                        histogram color
  -b HISTOGRAM_BINS, --histogram_bins HISTOGRAM_BINS
                        number of bins for the histograms [10]
  -p NOVEL_CONSERVED NOVEL_CONSERVED, --novel_conserved NOVEL_CONSERVED NOVEL_CONSERVED
                        upper and lower GNU score limits for novel and
                        conserved proteins novel_GNU_upper_limit,
                        conserved_GNU_lower_limit, respectively [Default 10,
                        100]
  -st STRAINS_TAG_VOLCANO, --strains_tag_volcano STRAINS_TAG_VOLCANO
                        a csv file of the strains of the two groups to be
                        compared with (case/control) tag
  -c CUTOFF_VOLCANO, --cutoff_volcano CUTOFF_VOLCANO
                        a percentage of isolates a protein must be in [Default:
                        100]
  -cc CASE_CONTROL_NAME CASE_CONTROL_NAME, --case_control_name CASE_CONTROL_NAME CASE_CONTROL_NAME
                        case and control groups' names [Default: case control]

Output

A heatmap, metadata percentage distribution bar plot, histogram and two volcano plots and summary statistics files.

Instructions for creating a database

Simple (GenBank)

  1. Download proteomes of a species (.faa) in a Directory from GenBank
    WhatsGNU_get_GenBank_genomes.py -f GCAs.txt Species_faa
    
  2. Modify the faa files to have the strains’ names
    WhatsGNU_database_customizer.py -c -g Species_modified Species_faa/
    
  3. Run WhatsGNU_main.py in basic mode
    WhatsGNU_main.py -m Species_modified_concatenated.faa query.faa
    

    Simple (Prokka-annotated faa files)

  4. Annotate your genomes with Prokka and put all faa files in one folder
  5. Modify the faa files to have the strains’ names
    WhatsGNU_database_customizer.py -c -p Species_modified Species_faa/
    
  6. Run WhatsGNU_main.py in basic mode
    WhatsGNU_main.py -m Species_modified_concatenated.faa query.faa
    

    query.faa should be any faa file. It won’t matter at this step

    Advanced (e.g. S. aureus)

  7. Download genomes of a species (.fna) in a Directory from GenBank
    WhatsGNU_get_GenBank_genomes.py -c GCAs.txt Sau_fna
    gunzip Sau_fna/*
    
  8. Annotate the genomes using Prokka An example command for S. aureus is given, change it or use any other options from Prokka
    for i in `cat file_names.list`;do prokka --kingdom Bacteria --outdir prokka_$i --gcode 11 --genus Staphylococcus --species aureus --strain $i --prefix $i --locustag $i Species_fna/$i*.fna; done
    find ./ -name '*.faa' -exec cp -prv '{}' '/Sau_faa/' ';'
    find ./ -name '*.gff' -exec cp -prv '{}' '/Sau_gff/' ';'
    
  9. Modify the faa and gff files to have the strains’ names
    WhatsGNU_database_customizer.py -c -p -l strain_name_list.csv Sau_modified_faa Sau_faa/
    WhatsGNU_database_customizer.py -i -s -l strain_name_list.csv -g Sau_modified_gff Sau_gff/
    

    The strain_name_list.csv is a comma-separated list of 3+ columns: file_name, old locustag, new locustag and optionally metadata. If metadata are provided, the script will concatenate the new locustag with metadata using ‘’ as a separator. The new locustag in this case will be: new_locustag_metadata. In case of GenBank, RefSeq and RAST, use NA for the old locustag column in the list.csv file.

  10. Run Roary for pangenome analysis An example command for Roary is given, change it or use any other options from Roary
    roary Sau_modified_gff/*.gff
    

    5.Run WhatsGNU_main.py in Ortholog mode using clustered_proteins output file from Roary

    WhatsGNU_main.py -m Sau_modified_concatenated.faa -r clustered_proteins query.faa
    

    Command line options for WhatsGNU_get_GenBank_genomes.py

    ``` WhatsGNU_get_GenBank_genomes.py -h usage: WhatsGNU_get_GenBank_assemblies.py [-h] [-f] [-c] [-r] list output_folder

Get GenBank assemblies (faa or/and fna) for WhatsGNU v1.0

positional arguments: list a list.txt file of GenBank accession numbers (GCA#.#) output_folder give name for output folder to be created

optional arguments: -h, –help show this help message and exit -f, –faa protein faa file from GenBank -c, –contigs genomic fna file from GenBank -r, –remove remove assembly_summary_genbank.txt after done

## Command line options for WhatsGNU_database_customizer.py

WhatsGNU_database_customizer.py -h usage: WhatsGNU_database_customizer.py [-h] [-g | -p | -r | -s] [-z] [-l LIST_CSV] [-i] [-c] prefix_name directory_path

Database_customizer script for WhatsGNU v1.0.

positional arguments: prefix_name prefix name for the output folder and the one concatenated modified file directory_path path to directory of faa, RAST txt or gff files

optional arguments: -h, –help show this help message and exit -g, –GenBank_RefSeq faa files from GenBank or RefSeq -p, –prokka faa files from Prokka -r, –RAST spreadsheet tab-separated text files from RAST -s, –gff_file gff file from prokka, needed if planning to run Roary -z, –gzipped compressed file (.gz) -l LIST_CSV, –list_csv LIST_CSV a file.csv of 3+ columns: file_name, old locustag, new locustag and optionally metadata -i, –individual_files individual modified files -c, –concatenated_file one concatenated modified file of all input files

## Example usage for WhatsGNU_main_hashes.py 
Using the hashed database to generate basic WhatsGNU reports

WhatsGNU_main_hashes.py -d Kp_basic_db_hashed_str.pickle -csv Kp_basic_db_hashed.csv -o WhatsGNU_Kp_op faa/

Finding the top 10 genomes closest genomes to your genomes of interest 

WhatsGNU_main_hashes.py -d PA_basic_db_hashed_str.pickle -csv PA_basic_db_hashed.csv -t -o WhatsGNU_PA_op faa/

By default, when using *-i/--ids_hits* the output report will report the hashed values of the hits. To get the accession numbers instead, use the *--accession-names* option

WhatsGNU_main_hashes.py -d basic_Ecoli_db_hashed_str.pickle -csv basic_Ecoli_db_hashed.csv -i –accession-names -o WhatsGNU_Ecoli_op faa/

## Command line options for WhatsGNU_main_hashes.py

usage: WhatsGNU_main.py [-h] [-d DATABASE] [-o OUTPUT_FOLDER] [–force] [-p PREFIX] [-t] [-csv CSV] [-tn TOPGENOMES_COUNT] [-s STRAINHITS] [-i] [–accession_names] [–hash_values] [-q] [-v] query_faa

WhatsGNU v1.4 utilizes the natural variation in public databases to rank protein sequences based on the number of observed exact protein matches (the GNU score) in all known genomes of a particular species. It generates a report for all the proteins in your query in seconds.

positional arguments: query_faa Query protein FASTA file/s to analyze (.faa)

options: -h, –help show this help message and exit -d DATABASE, –database DATABASE you have to provide path to your compressed database -o OUTPUT_FOLDER, –output_folder OUTPUT_FOLDER Database output prefix to be created for results (default: timestamped WhatsGNU_results in the current directory) –force Force overwriting existing results folder assigned with -o (default: off) -p PREFIX, –prefix PREFIX Prefix for output compressed database (default: WhatsGNU_compressed_database) -t, –topgenomes create a file of top N genomes with most number of exact matches to query [Default top 10 genomes] -csv CSV csv file of hashed inputs -tn TOPGENOMES_COUNT, –topgenomes_count TOPGENOMES_COUNT select number of closest top genomes to show [Default top 10 genomes] -s STRAINHITS, –strainhits STRAINHITS check how many hits you get from a particular strain,it has to be used with -t -i, –ids_hits create a file of each protein with locus_tags (ids) of all hits from the database, large file (~ 1 Gb for 3000 pts) –accession_names to be used with –ids_hits. If this option is selected, writes the id_hits file with the accession names. –hash_values to be used with –ids_hits. Default option. This options writes the id_hits file with the hashed values. -q, –quiet No screen output [default OFF] -v, –version print version and exit ```

Requests for creating a database

Requests to process a database for a specific species are welcomed and will be considered

Bugs

Please submit via the GitHub issues page: https://github.com/ahmedmagds/WhatsGNU/issues

Software Licence

GPLv3: https://github.com/ahmedmagds/WhatsGNU/blob/master/LICENSE

Citations

WhatsGNU

WhatsGNU: a tool for identifying proteomic novelty
Moustafa AM and Planet PJ 2020, Genome Biology;21:58

Other tools