Process

MRF process

1a: UP000074855_berghei.fasta

1b: result_UP000074855_berghei.csv

2a: A list of proteins from MRF output, should include length of the protein, description

3a: annoted_results.csv (file with annotation using pfsearchV3 and a dataset of profiles for each TR)

4a: details_cluster_report.csv

4b: summary_cluster_report.csv

4c: top_clusters.png

5a: aacompresult_UP000074855_berghei.csv; aacompresult_UP000074855_berghei_by_group.csv

6a: Generate a column in the MRF output file with the length group of each TR

7a: data_composition.csv (from Stefany’s script)

SCRIPTS

1: Run MRF for all the proteins in a proteome fasta file

2: Should obtain all protein ids, description, length, no redundancy

3: execute_annotations.py

4: percent_consensus_dbcscan.py

5: get_aa_composition_regionsH3.py

6: split_mrf_by_groupH1.py

7: TablesTR_fullSteps.py

TAPASS process

2a: A list of proteins from MRF output, should include length of the protein, description and maybe the sequence from MRF process.

8a:Result_TAPASS_berghei.csv

10a Cath_annotation.csv; pfam_annotation.csv, Slims_annotation.csv

11a merge_complete.csv

12a Cath_annotation_outsideTR.csv; pfam_annotation_outsideTR.csv, Slims_annotation_outsideTR.csv

DESCRIPTIONS

8: executes tapass in the chosen set

9: Get all the lines that correspond to one protein, see which one of them overlaps with a repeat region from MRF. Then for each repeat region analyze all the resulting predicted lines from tapass using the threshold rules.10: If the threshold is achieved save the corresponding tapass lines predicted in the 10a file

11: Create a merge in which the user can select all or some of the following: transmembrane, disorder, functional domain, SLIMs, structural domain, amyloidogenicity_AR and amyloidogenicity_AR for the MRF repeat regions of a chosen set.

12: If there are lines that do not overlap but their prediction tool is PFAM, CATH or ELM or SignalP then save all the corresponding information from tapass in 12a file