Data Repeatability Indices DAIRI & IDAIRI

Proprietary metrics for evaluating the uniqueness and similarity of peptide databases

Optimization for Real-World Results

The initial formulas for the DAIRI and IDAIRI indices, described in the original publication, were based on the total number of comparisons generated by the diamond Aligner. This method, while innovative, proved to be highly dependent on the size of the databases being compared, which made objective evaluation difficult.

In response to these challenges, the index formulas have been thoroughly optimized. The new approach abandons the analysis of the number of diamond comparisons (which is why the final "d" was removed from the index names) in favor of a direct evaluation of the sequence composition within the databases. According to the principle of determining absolute repeatability of identity, the optimized formulas are based on counting perfect duplicates (100% identity) as well as unique sequences shared between the compared databases, which makes it possible to obtain significantly more reliable and useful results.

DAIRI Index

Database Absolute-Identity Repeatability Index

Internal Absolute Repeatability is an index that determines what proportion of a database consists of perfect duplicates.

DAIRI = N_duplicates / N_total - 1

N_duplicates – the number of duplicates (100% identical sequences) within the same database.

N_total – the total number of sequences in the analyzed database.

High DAIRI Value (e.g., 0.800)

Means high repeatability. As many as 80% of the sequences in the database are perfect duplicates of other entries. From the perspective of searching for unique peptides, this is an undesirable result.

Low DAIRI Value (e.g., 0.100)

Means low repeatability and high uniqueness. Only 10% of the sequences are duplicates. This is a desirable result, indicating great diversity in the database.

IDAIRI Index

Inter-Database Absolute-Identity Repeatability Index

Inter-Database Absolute Repeatability is an index that determines to what extent two databases overlap in terms of perfect duplicates.

IDAIRI(A → B) = N_{A_unique ∩ B_unique} / |A_unique|

A_unique and B_unique – sets of unique sequences from databases A and B.

N_{A_unique ∩ B_unique} – the number of unique sequences that are common to both databases.

|A_unique| – the total number of unique sequences in query database A.

High IDAIRI Value (e.g., 0.950)

Means high coverage. As many as 95% of the unique peptides from database A have their perfect counterparts in database B. Database A contributes little new information. This is an undesirable result.

Low IDAIRI Value (e.g., 0.050)

Means low coverage. Database A contains mainly unique peptides that are not found in database B. This is a desirable result, indicating high novelty and informational value of database A.

Bibliography

Marczak, B., Bocian, A., & Łyskowski, A. (2025). Antimicrobial Peptide Databases as the Guiding Resource in New Antimicrobial Agent Identification via Computational Methods. Molecules, 30, 1318. https://doi.org/10.3390/molecules30061318