Documentation for genie3 module

bengrn.tools.genie3

Functions:

Name Description
GENIE3

GENIE3 Computation of tree-based scores for all putative regulatory links.

get_link_list

Gets the ranked list of (directed) regulatory links.

GENIE3

GENIE3 Computation of tree-based scores for all putative regulatory links.

Parameters:
  • expr_data (ndarray) –

    Array containing gene expression values. Each row corresponds to a condition and each column corresponds to a gene.

  • gene_names (list[str], default: None ) –

    List of length p, where p is the number of columns in expr_data, containing the names of the genes. The i-th item of gene_names must correspond to the i-th column of expr_data. Defaults to None.

  • regulators (list[str], default: 'all' ) –

    List containing the names of the candidate regulators. When a list of regulators is provided, the names of all the genes must be provided (in gene_names). When regulators is set to 'all', any gene can be a candidate regulator. Defaults to 'all'.

  • tree_method (str, default: 'RF' ) –

    Specifies which tree-based procedure is used: either Random Forest ('RF') or Extra-Trees ('ET'). Defaults to 'RF'.

  • K (str or int, default: 'sqrt' ) –

    Specifies the number of selected attributes at each node of one tree: either the square root of the number of candidate regulators ('sqrt'), the total number of candidate regulators ('all'), or any positive integer. Defaults to 'sqrt'.

  • ntrees (int, default: 1000 ) –

    Specifies the number of trees grown in an ensemble. Defaults to 1000.

  • nthreads (int, default: 1 ) –

    Number of threads used for parallel computing. Defaults to 1.

Returns:
  • numpy.ndarray: An array in which the element (i,j) is the score of the edge directed from the i-th gene to the j-th gene. All diagonal elements are set to zero (auto-regulations are not considered). When a list of candidate regulators is provided, the scores of all the edges directed from a gene that is not a candidate regulator are set to zero.

Source code in bengrn/tools/genie3.py
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
def GENIE3(
    expr_data: ndarray,
    gene_names: Optional[list[str]] = None,
    regulators: Union[list[str], str] = "all",
    tree_method: str = "RF",
    K: Union[str, int] = "sqrt",
    ntrees: int = 1000,
    nthreads: int = 1,
):
    """
    GENIE3 Computation of tree-based scores for all putative regulatory links.

    Args:
        expr_data (numpy.ndarray): Array containing gene expression values. Each row corresponds to a condition and each column corresponds to a gene.
        gene_names (list[str], optional): List of length p, where p is the number of columns in expr_data, containing the names of the genes. The i-th item of gene_names must correspond to the i-th column of expr_data. Defaults to None.
        regulators (list[str], optional): List containing the names of the candidate regulators. When a list of regulators is provided, the names of all the genes must be provided (in gene_names). When regulators is set to 'all', any gene can be a candidate regulator. Defaults to 'all'.
        tree_method (str, optional): Specifies which tree-based procedure is used: either Random Forest ('RF') or Extra-Trees ('ET'). Defaults to 'RF'.
        K (str or int, optional): Specifies the number of selected attributes at each node of one tree: either the square root of the number of candidate regulators ('sqrt'), the total number of candidate regulators ('all'), or any positive integer. Defaults to 'sqrt'.
        ntrees (int, optional): Specifies the number of trees grown in an ensemble. Defaults to 1000.
        nthreads (int, optional): Number of threads used for parallel computing. Defaults to 1.

    Returns:
        numpy.ndarray: An array in which the element (i,j) is the score of the edge directed from the i-th gene to the j-th gene. All diagonal elements are set to zero (auto-regulations are not considered). When a list of candidate regulators is provided, the scores of all the edges directed from a gene that is not a candidate regulator are set to zero.
    """
    time_start = time.time()

    # Check input arguments
    if not isinstance(expr_data, ndarray):
        raise ValueError(
            "expr_data must be an array in which each row corresponds to a condition/sample and each column corresponds to a gene"
        )

    ngenes = expr_data.shape[1]

    if gene_names is not None:
        if not isinstance(gene_names, (list, tuple)):
            raise ValueError("input argument gene_names must be a list of gene names")
        elif len(gene_names) != ngenes:
            raise ValueError(
                "input argument gene_names must be a list of length p, where p is the number of columns/genes in the expr_data"
            )

    if regulators != "all":
        if not isinstance(regulators, (list, tuple)):
            raise ValueError("input argument regulators must be a list of gene names")

        if gene_names is None:
            raise ValueError(
                "the gene names must be specified (in input argument gene_names)"
            )
        else:
            sIntersection = set(gene_names).intersection(set(regulators))
            if not sIntersection:
                raise ValueError(
                    "the genes must contain at least one candidate regulator"
                )

    if tree_method != "RF" and tree_method != "ET":
        raise ValueError(
            'input argument tree_method must be "RF" (Random Forests) or "ET" (Extra-Trees)'
        )

    if K != "sqrt" and K != "all" and not isinstance(K, int):
        raise ValueError(
            'input argument K must be "sqrt", "all" or a stricly positive integer'
        )

    if isinstance(K, int) and K <= 0:
        raise ValueError(
            'input argument K must be "sqrt", "all" or a stricly positive integer'
        )

    if not isinstance(ntrees, int):
        raise ValueError("input argument ntrees must be a stricly positive integer")
    elif ntrees <= 0:
        raise ValueError("input argument ntrees must be a stricly positive integer")

    if not isinstance(nthreads, int):
        raise ValueError("input argument nthreads must be a stricly positive integer")
    elif nthreads <= 0:
        raise ValueError("input argument nthreads must be a stricly positive integer")

    print("Tree method: " + str(tree_method))
    print("K: " + str(K))
    print("Number of trees: " + str(ntrees))
    print("\n")

    # Get the indices of the candidate regulators
    if regulators == "all":
        input_idx = list(range(ngenes))
    else:
        input_idx = [i for i, gene in enumerate(gene_names) if gene in regulators]

    # Learn an ensemble of trees for each target gene, and compute scores for candidate regulators
    VIM = zeros((ngenes, ngenes))

    if nthreads > 1:
        print("running jobs on %d threads" % nthreads)

        input_data = list()
        for i in range(ngenes):
            input_data.append([expr_data, i, input_idx, tree_method, K, ntrees])

        pool = Pool(nthreads)
        alloutput = list(
            tqdm.tqdm(pool.imap(wr_GENIE3_single, input_data), total=ngenes)
        )

        for i, vi in alloutput:
            VIM[i, :] = vi

    else:
        print("running single threaded jobs")
        for i in range(ngenes):
            print("Gene %d/%d..." % (i + 1, ngenes))

            vi = GENIE3_single(expr_data, i, input_idx, tree_method, K, ntrees)
            VIM[i, :] = vi

    VIM = transpose(VIM)

    time_end = time.time()
    print("Elapsed time: %.2f seconds" % (time_end - time_start))

    return VIM

Gets the ranked list of (directed) regulatory links.

Args VIM (np.array): Array as returned by the function GENIE3(), in which the element (i,j) is the score of the edge directed from the i-th gene to the j-th gene. gene_names (list[str] optional): List of length p, where p is the number of rows/columns in VIM, containing the names of the genes. The i-th item of gene_names must correspond to the i-th row/column of VIM. When the gene names are not provided, the i-th gene is named Gi. Default is None. regulators (list[str], optional): List containing the names of the candidate regulators. When a list of regulators is provided, the names of all the genes must be provided (in gene_names), and the returned list contains only edges directed from the candidate regulators. When regulators is set to 'all', any gene can be a candidate regulator. Default is 'all'. maxcount (Union[str, int], optional): Writes only the first maxcount regulatory links of the ranked list. When maxcount is set to 'all', all the regulatory links are written. Default is 'all'. file_name (str, optional): Writes the ranked list of regulatory links to the file file_name. Default is None.

Returns pd.Dataframe: The list of regulatory links, ordered according to the edge score. Auto-regulations do not appear in the list. Regulatory links with a score equal to zero are randomly permuted. In the ranked list of edges, each line has format: regulator target gene score of edge

Source code in bengrn/tools/genie3.py
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
def get_link_list(
    VIM: ndarray,
    gene_names: Optional[list[str]] = None,
    regulators: Union[list[str], str] = "all",
    maxcount: Union[int, str] = "all",
    file_name: Optional[str] = None,
):
    """Gets the ranked list of (directed) regulatory links.

    Args
        VIM (np.array): Array as returned by the function GENIE3(), in which the element (i,j) is the score of the edge directed from the i-th gene to the j-th gene.
        gene_names (list[str] optional): List of length p, where p is the number of rows/columns in VIM, containing the names of the genes. The i-th item of gene_names must correspond to the i-th row/column of VIM. When the gene names are not provided, the i-th gene is named Gi. Default is None.
        regulators (list[str], optional): List containing the names of the candidate regulators. When a list of regulators is provided, the names of all the genes must be provided (in gene_names), and the returned list contains only edges directed from the candidate regulators. When regulators is set to 'all', any gene can be a candidate regulator. Default is 'all'.
        maxcount (Union[str, int], optional): Writes only the first maxcount regulatory links of the ranked list. When maxcount is set to 'all', all the regulatory links are written. Default is 'all'.
        file_name (str, optional): Writes the ranked list of regulatory links to the file file_name. Default is None.

    Returns
        pd.Dataframe: The list of regulatory links, ordered according to the edge score. Auto-regulations do not appear in the list. Regulatory links with a score equal to zero are randomly permuted. In the ranked list of edges, each line has format:
            regulator   target gene     score of edge
    """

    # Check input arguments
    if not isinstance(VIM, ndarray):
        raise ValueError("VIM must be a square array")
    elif VIM.shape[0] != VIM.shape[1]:
        raise ValueError("VIM must be a square array")

    ngenes = VIM.shape[0]

    if gene_names is not None:
        if not isinstance(gene_names, (list, tuple)):
            raise ValueError("input argument gene_names must be a list of gene names")
        elif len(gene_names) != ngenes:
            raise ValueError(
                "input argument gene_names must be a list of length p, where p is the number of columns/genes in the expression data"
            )

    if regulators != "all":
        if not isinstance(regulators, (list, tuple)):
            raise ValueError("input argument regulators must be a list of gene names")

        if gene_names is None:
            raise ValueError(
                "the gene names must be specified (in input argument gene_names)"
            )
        else:
            sIntersection = set(gene_names).intersection(set(regulators))
            if not sIntersection:
                raise ValueError(
                    "The genes must contain at least one candidate regulator"
                )

    if maxcount != "all" and not isinstance(maxcount, int):
        raise ValueError('input argument maxcount must be "all" or a positive integer')

    if file_name is not None and not isinstance(file_name, str):
        raise ValueError("input argument file_name must be a string")

    # Get the indices of the candidate regulators
    if regulators == "all":
        input_idx = range(ngenes)
    else:
        input_idx = [i for i, gene in enumerate(gene_names) if gene in regulators]

    # Get the non-ranked list of regulatory links
    vInter = [
        (i, j, score) for (i, j), score in ndenumerate(VIM) if i in input_idx and i != j
    ]

    # Rank the list according to the weights of the edges
    vInter_sort = sorted(vInter, key=itemgetter(2), reverse=True)
    nInter = len(vInter_sort)

    # Random permutation of edges with score equal to 0
    flag = 1
    i = 0
    while flag and i < nInter:
        (TF_idx, target_idx, score) = vInter_sort[i]
        if score == 0:
            flag = 0
        else:
            i += 1

    if not flag:
        items_perm = vInter_sort[i:]
        items_perm = random.permutation(items_perm)
        vInter_sort[i:] = items_perm

    # Write the ranked list of edges
    nToWrite = nInter
    if isinstance(maxcount, int) and maxcount >= 0 and maxcount < nInter:
        nToWrite = maxcount

    if file_name:
        outfile = open(file_name, "w")

        if gene_names is not None:
            for i in range(nToWrite):
                (TF_idx, target_idx, score) = vInter_sort[i]
                TF_idx = int(TF_idx)
                target_idx = int(target_idx)
                outfile.write(
                    "%s\t%s\t%.6f\n"
                    % (gene_names[TF_idx], gene_names[target_idx], score)
                )
        else:
            for i in range(nToWrite):
                (TF_idx, target_idx, score) = vInter_sort[i]
                TF_idx = int(TF_idx)
                target_idx = int(target_idx)
                outfile.write("G%d\tG%d\t%.6f\n" % (TF_idx + 1, target_idx + 1, score))

        outfile.close()

    else:
        if gene_names is not None:
            for i in range(nToWrite):
                (TF_idx, target_idx, score) = vInter_sort[i]
                TF_idx = int(TF_idx)
                target_idx = int(target_idx)
                print(
                    "%s\t%s\t%.6f" % (gene_names[TF_idx], gene_names[target_idx], score)
                )
        else:
            for i in range(nToWrite):
                (TF_idx, target_idx, score) = vInter_sort[i]
                TF_idx = int(TF_idx)
                target_idx = int(target_idx)
                print("G%d\tG%d\t%.6f" % (TF_idx + 1, target_idx + 1, score))