Title: | Tree-Based Scan Statistics |
---|---|
Description: | Implementation of unconditional Bernoulli Scan Statistic developed by Kulldorff et al. (2003) <doi:10.1111/1541-0420.00039> for hierarchical tree structures. Tree-based Scan Statistics are an exploratory method to identify event clusters across the space of a hierarchical tree. |
Authors: | Joshua P. Entrop [aut, cre, cph]
|
Maintainer: | Joshua P. Entrop <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.2 |
Built: | 2025-03-11 05:47:22 UTC |
Source: | https://github.com/entjos/treeminer |
A dataset including the following column:
pathString
A string identifying all the parents of a node. Each
parent is separated by a /
.
data(atc_codes)
data(atc_codes)
TreeMineR()
.Creating a tree file for further use in TreeMineR()
.
create_tree(x)
create_tree(x)
x |
A data frame that includes two or three columns:
|
A data.frame with one variable pathString
that describes the
full path for each leaf included in the hierarchical tree.
A simulated dataset of hospital diagnoses created with the help of
the comorbidity
package including the following columns:
Individual identifier,
Indicator for case status,
An ICD-10 diagnosis code.
data(diagnoses)
data(diagnoses)
A data frame with 23,144 rows and 3 columns
Remove cuts from your tree. This is, e.g., useful if you would like to remove certain chapters from the ICD-10 tree used for the analysis as some chapters might be a prior deemed irrelevant for the exposure of interest, e.g., chapter 20 (external causes of death) might not be of interest when comparing two drug exposures.
drop_cuts(tree, cuts, delimiter = "/", return_removed = FALSE)
drop_cuts(tree, cuts, delimiter = "/", return_removed = FALSE)
tree |
A dataset with one variable |
cuts |
A character vector of cuts to remove. Please make sure that your string
uniquely identifies the cut that should be removed. Each string is passed
to Regular expression are composed as follows:
|
delimiter |
A character defining the delimiter of different tree levels within your
|
return_removed |
A logical value for indicating whether you would like to get a list of removed cuts returned by the function. |
If return_removed = FALSE
a data.frame with a single variable named
pathString
is returned, which includes the updated tree. If
return_removed = TRUE
a list with two elements is return:
The updated tree file
A list of character vectors including the paths that
have been removed from the supplied tree. The list is named using the
cuts supplied to cut
.
drop_cuts(icd_10_se, c("B35-B49", "F41")) |> head()
drop_cuts(icd_10_se, c("B35-B49", "F41")) |> head()
A dataset including the following column:
pathString
A string identifying all the parents of a node. Each
parent is separated by a /
.
data(icd_10_se)
data(icd_10_se)
A dataset including the following column:
node
A string identidying a node
title
A label for the node
data(icd_10_se_dict)
data(icd_10_se_dict)
Unconditional Bernoulli Tree-Based Scan Statistics for R
TreeMineR( data, tree, p = NULL, n_exposed = NULL, n_unexposed = NULL, dictionary = NULL, delimiter = "/", n_monte_carlo_sim = 9999, random_seed = FALSE, return_test_dist = FALSE, future_control = list(strategy = "sequential") )
TreeMineR( data, tree, p = NULL, n_exposed = NULL, n_unexposed = NULL, dictionary = NULL, delimiter = "/", n_monte_carlo_sim = 9999, random_seed = FALSE, return_test_dist = FALSE, future_control = list(strategy = "sequential") )
data |
The dataset used for the computation. The dataset needs to include the following columns:
See below for the first and last rows included in the example dataset. id leaf exposed 1 K251 0 2 Q702 0 3 G96 0 3 S949 0 4 S951 0 --- 999 V539 1 999 V625 1 999 G823 1 1000 L42 1 1000 T524 1 |
tree |
A dataset with one variable |
p |
The proportion of exposed individuals in the dataset. Will be calculated
based on |
n_exposed |
Number of exposed individuals (Optional). |
n_unexposed |
Number of unexposed individuals (Optional). |
dictionary |
A |
delimiter |
A character defining the delimiter of different tree levels within your
|
n_monte_carlo_sim |
The number of Monte-Carlo simulations to be used for calculating P-values. |
random_seed |
Random seed used for the Monte-Carlo simulations. |
return_test_dist |
If |
future_control |
A list of arguments passed |
A data.frame
with the following columns:
cut
The name of the cut G.
n1
The number of exposed events belonging to cut G.
n1
The number of inexposed events belonging to cut G.
risk1
The absolute risk of getting an event belonging to cut G among the exposed.
risk0
The absolute risk of getting an event belonging to cut G among the unexposed.
RR
The risk ratio of the absolute risk among the exposed over the absolute risk among the unexposed
llr
The log-likelihood ratio comparing the observed and expected number of exposed events belonging to cut G.
p
The P-value that cut G is a cluster of events.
If return_test_dist
is true
the function returns a list of two
data.frame.
result_table
A data.frame including the results as described above.
test_dist
A data.frame with two columns: iteration
the number
of the Monte Carlo iteration. Note that iteration
is the calculation based on the original data and
is, hence, not included in this data.fame. max_llr
:
the highest observed log-likelihood ratio for each
Monte Carlo simulation
Kulldorff et al. (2003) A tree-based scan statistic for database disease surveillance. Biometrics 56(2): 323-331. DOI: 10.1111/1541-0420.00039.
TreeMineR(data = diagnoses, tree = icd_10_se, p = 1/11, n_monte_carlo_sim = 99, random_seed = 1234) |> head()
TreeMineR(data = diagnoses, tree = icd_10_se, p = 1/11, n_monte_carlo_sim = 99, random_seed = 1234) |> head()