Following, the set of queries used in this analysis
Queries used for calculating pleitropy rates
The queries were retrieved using SCAIView version 1.7.3 Corresponding to the indexing of MEDLINE on 2016-07-14T13:50:07.797575Z.
*Note that the reference queries might take time since thousand of articles need to be analyzed.
|Disease||Reference Query||Number of documents||Genes associated with the disease||Gene set size||Normalized pleitropy rate (%)|
Description of each column:
Column 1. Disease.
Column 2. Reference query for the disease.
Column 3. Number of documents retrieved using the disease reference query.
Column 4. Total number of genes found in the corpus retrieved with the reference query for the disease.
Column 5. Number of genes with a relative entropy greater than 0 retrieved from a query containing the disease of interest and epilepsy. An example for diabetes would use the following query: MeSH Disease:"Epilepsy" AND MeSH Disease:"Diabetes Mellitus" and the corpus would contain articles that mention Epilepsy and Diabetes. The relative entropy is calculated using the occurrence of genes/proteins within this query and comparing with their occurrence in MEDLINE.
Column 6. Normalized pleitropy rate. Overlap of genes in comparison with the Epilepsy geneset (total of 2901 genes) containing genes with a relative entropy greater than 0 using the Epilepsy reference query MeSH Disease:"Epilepsy" (192245 documents).
First column structure: Common Name;Internal Identifier;Relative Entropy;Reference Entity Count;Entity Count;Query Entity Count;
HGNC names and relative entropy greater than 0 will only be extracted
It seems to be a problem with the structure of the exported csv file because pandas is not able to import it
Explanation about the calculation of relative entropies can be found in: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3541249/
Relative entropy (p1, p2) = p1 * log (p1/p2)
Where p1 is the number of abstracts containing the entity in the query selected corpus and p2 denotes the total number of documents in which the entity occurs within an unspecific reference corpus (i.e. the entire Medline). The Kullback–Leibler divergence ranks those entities high, which have especially high frequency in the selected corpus in comparison to the unspecific reference corpus. This means that frequently occurring entities do not receive high ranks. For example, using the query “ ‘Alzheimer’s Disease’ AND ‘Evidence marker’ AND ‘Human Genes/Proteins’ ”, we retrieved 331 abstracts containing IL1B with a frequency ranking of 10. Conversely, according to the relative entropy formula, IL1B has an entropy rank of 34 despite its high occurrence in Medline (i.e. 40685 abstracts).