Documentation

Overview

mimicINTweb provide an easy-to-use interface to the mimicINT workflow, which has been particularly designed to infer the interactions between microbe proteins of interest and a substantial fraction of the human proteome using known interaction templates (domain-domain and motif-domain).

mimicINT takes as input the FASTA-formatted sequences of microbe proteins (e.g., viral or other pathogen proteins susceptible to be found at the pathogen-host interface) to detect host-like elements: domains and SLiMs. The domain identification is performed by the InterProScan tool using the domain signatures from the InterPro database. The host-like SLiMs detection exploits the motif definitions available in the ELM database and is carried out by the SLiMProb tool from the SLiMSuite software package. As SLiMs are usually located in disordered regions, SLiMProb uses the IUPred algorithm to compute the disorder propensity of each amino acid in the query sequences, and generates an average disorder propensity score for every detected SLiM occurrence.

Subsequently, mimicINT infers the interactions between host and microbe proteins. This analysis takes as input the list of known interactions templates gathered from two resources: (i) the 3did database, a collection of domain-domain interactions extracted from three-dimensional protein structures, and (ii) the ELM database that provides a list of experimentally identified SLIM-domain interactions in Eukaryotes. The inference checks whether any of the microbe proteins contains at least one domain or SLIM for which an interaction template is available. In this case, it infers the interaction between the given protein and all the host proteins containing the cognate domain (i.e., the interacting domain in the template). As motif-binding domains of the same group, like SH3 or PDZ, show different interaction specificities, for the SLiM-domain interaction inference, we have implemented a previously proposed strategy to take these differences into account. This approach assigns a "domain score" that can be used to rank or filter inferred SLiM-domain interactions (see below).

In the final step, in order to identify the host cellular functions potentially targeted by the pathogen proteins, mimicINT executes a functional enrichment analysis of host inferred interactors. This analysis statistically assesses the over-representation of functional categories, such as Gene Ontology terms and biological pathways (e.g., KEGG and Reactome), using the g:Profiler R client.

Overview of the mimicINT workflow — **Figure 1. Overview of the *mimic*INT workflow.**
By providing a fasta file of protein sequences of interest (*e.g.*, microbe sequences) (A), *mimic*INT allows identifying both the domain (B) and SLiM (C) mediated interfaces of interactions. Using publicly available templates of interactions, *mimic*INT infers the interactions between the proteins of the query and target (*i.e.*, host) species (D). Finally, it provides a list of functional annotations that are significantly enriched in inferred protein targets (E).

Computation of the motif-binding domain scores

To identify motif-binding domains that can be specifically associated to a given ELM motif class, we use the same strategy proposed by Weatheritt et al. (2012), which assumes that a domain significantly similar to a known motif-binding domain should also bind the same motif. We first compiled a list of experimentally identified motif bind domains by gathering the original list from Weatheritt et al. complemented by more recent annotations from the ELM database (August 2020). Obsolete ELM class identifiers from Weatheritt et al. were mapped to current ELM identifiers using the "Renamed ELM classes file (http://elm.eu.org/infos/browse_renamed.tsv) and duplicated domain annotations were removed. In total, we collected 538 domains in 415 human proteins known to bind 212 ELM motif classes (73% of the 290 motif classes present in ELM, August 2020). The sequences of these 415 annotated proteins were fetched from UniprotKB. We next gathered the sequences of 1452 reference Eukaryota proteomes (22,262,113 protein sequences in total) from UniprotKB (August 2020). We removed redundancy using the CD-HIT algorithm to generate a database of 21,414,544 non-identical sequences. We used the GOPHER tool from the SLiMSuite package to identify orthologous sequences of the annotated proteins in the database of non-identical eukaryotic sequences by reciprocal BLAST best hits. Selected orthologous proteins were aligned using the multiple sequence alignment algorithm Clustal Omega (version 1.2.4). Once the position of the motif-binding domain was identified within the alignment, we removed aligned domains with indels covering >10% of the annotated domain sequence. We iteratively realigned the sequences until a set of proteins was identified with <10% indels coverage. In total, we selected 701 multiple sequence alignments that were used as input for generating domain-specific HMM profiles with the hmmbuild program from the HMMER package (version 3.1.1). Subsequently, we scanned a representative set of the human proteome (20,350 “reviewed” sequences from UniprotKB) with the domain-specific HMMs using the hmmsearch program. We used a E-value cutoff of 0.01 to select the best hits and we rejected those hits with a length of <90% of the annotated motif-binding domain sequence length. Finally, the E-value of the best-scoring domain was converted into a domain similarity score using the iELM script downloaded from http://elmint.embl.de/program_file/.

You can find more technical details on the mimicINT GitHub repository: https://github.com/TAGC-NetworkBiology/mimicINT

And in the mimicINT paper: https://doi.org/10.1101/2022.11.04.515250

Tools and resources used by mimicINT

3did: https://3did.irbbarcelona.org/
ELM database: http://elm.eu.org/index.html
gProfiler2 R client: https://biit.cs.ut.ee/gprofiler/page/r
InterProScan: https://www.ebi.ac.uk/interpro/search/sequence/
IUPred: https://iupred2a.elte.hu/
SLiMSuite: https://github.com/slimsuite

Relevant references

Blum,M. et al. (2021) The InterPro protein families and domains database: 20 years on. Nucleic Acids Res, 49, D344–D354. https://doi.org/10.1093/nar/gkaa977

Davey,N.E. et al. (2007) The SLiMDisc server: short, linear motif discovery in proteins. Nucleic Acids Res., 35, W455-459. https://doi.org/10.1093/nar/gkm400

Dosztányi,Z. (2018) Prediction of protein disorder based on IUPred. Protein Sci, 27, 331–340. https://doi.org/10.1002/pro.3334

Edwards,R.J. et al. (2020) Computational Prediction of Disordered Protein Motifs Using SLiMSuite. Methods Mol Biol, 2141, 37–72. https://doi.org/10.1007/978-1-0716-0524-0_3

Edwards,R.J. and Palopoli,N. (2015) Computational prediction of short linear motifs from protein sequences. Methods Mol. Biol., 1268, 89–141. https://doi.org/10.1007/978-1-4939-2285-7_6

Jones,P. et al. (2014) InterProScan 5: genome-scale protein function classification. Bioinformatics, 30, 1236–1240. https://doi.org/10.1093/bioinformatics/btu031

Kumar,M. et al. (2020) ELM-the eukaryotic linear motif resource in 2020. Nucleic Acids Res, 48, D296–D306. https://doi.org/10.1093/nar/gkz1030

Mosca,R. et al. (2014) 3did: a catalog of domain-based interactions of known three-dimensional structure. Nucleic Acids Res, 42, D374-379. https://doi.org/10.1093/nar/gkt887

Paulsen,K. (2019) Optimising intrinsic disorder prediction for short linear motif discovery. PhD thesis. https://doi.org/10.26190/unsworks/21456

Weatheritt,R.J. et al. (2012) The identification of short linear motif-mediated interfaces within the human interactome. Bioinformatics, 28, 976–982. https://doi.org/10.1093/bioinformatics/bts072