Title: | Sum of Ranking Differences Statistical Test |
---|---|
Description: | We provide an implementation for Sum of Ranking Differences (SRD), a novel statistical test introduced by Héberger (2010) <doi:10.1016/j.trac.2009.09.009>. The test allows the comparison of different solutions through a reference by first performing a rank transformation on the input, then calculating and comparing the distances between the solutions and the reference - the latter is measured in the L1 norm. The reference can be an external benchmark (e.g. an established gold standard) or can be aggregated from the data. The calculated distances, called SRD scores, are validated in two ways, see Héberger and Kollár-Hunek (2011) <doi:10.1002/cem.1320>. A randomization test (also called permutation test) compares the SRD scores of the solutions to the SRD scores of randomly generated rankings. The second validation option is cross-validation that checks whether the rankings generated from the solutions come from the same distribution or not. For a detailed analysis about the cross-validation process see Sziklai, Baranyi and Héberger (2021) <doi:10.48550/arXiv.2105.11939>. The package offers a wide array of features related to SRD including the computation of the SRD scores, validation options, input preprocessing and plotting tools. |
Authors: | Jochen Staudacher [aut, cph, cre] , Balázs R. Sziklai [aut, cph] , Linus Olsson [aut, cph], Dennis Horn [ctb], Alexander Pothmann [ctb], Ali Tugay Sen [ctb], Attila Gere [ctb] , Károly Hébeger [ctb] |
Maintainer: | Jochen Staudacher <[email protected]> |
License: | GPL-3 |
Version: | 0.1.8 |
Built: | 2024-11-02 03:03:44 UTC |
Source: | https://github.com/cran/rSRD |
R interface to test whether the rankings induced by the columns come from the same distribution. If the number of folds and the test method are not specified, the default is the 8-fold Wilcoxon test combined with cross-validation. If the number of rows is less than 8, leave-one-out cross-validation is applied. Columns are ordered based on the SRD values of the different folds, then each consecutive column-pairs are tested. Test statistics for Alpaydin test follows F distribution with df1=2k, df2=k degrees of freedom. Dietterich test statistics follow t-distribution with k degrees of freedom (two-tailed). Wilcoxon test statistics is calculated as the absolute value of the difference of the sum of the positive ranks (W+) and sum of the negative ranks (W-). The distribution for this test statistics can be derived from the Wilcoxon signed rank distribution. For more information about the cross-validation process see Sziklai, Baranyi and Héberger (2021).
calculateCrossValidation( data_matrix, method = "Wilcoxon", number_of_folds = 8, precision = 5, output_to_file = TRUE )
calculateCrossValidation( data_matrix, method = "Wilcoxon", number_of_folds = 8, precision = 5, output_to_file = TRUE )
data_matrix |
A DataFrame. |
method |
A string specifying the method. The methods "Wilcoxon", "Alpaydin" and "Dietterich" are available. |
number_of_folds |
The number of folds used in the cross validation. Ranges between 5 to 10. |
precision |
The precision used for the the ranking matrix transformation. |
output_to_file |
Boolean flag to enable file output. |
A List containing
a new column order sorted by the median of the SRD values computed on the different folds
a vector of test statistics corresponding to each consecutive column pairs
a vector indicating the test statistics' statistical significance
the SRD values of different folds and
additional data needed for the plotCrossValidation function.
Balázs R. Sziklai [email protected], Linus Olsson [email protected], Jochen Staudacher [email protected]
Sziklai, Balázs R., Máté Baranyi, and Károly Héberger (2021). "Testing Cross-Validation Variants in Ranking Environments", arXiv preprint arXiv:2105.11939 (2021).
df <- data.frame( Sol_1=c(7, 6, 5, 4, 3, 2, 1), Sol_2=c(1, 2, 3, 4, 5, 7, 6), Sol_3=c(1, 2, 3, 4, 7, 5, 6), Ref=c(1, 2, 3, 4, 5, 6, 7)) calculateCrossValidation(df, output_to_file = FALSE)
df <- data.frame( Sol_1=c(7, 6, 5, 4, 3, 2, 1), Sol_2=c(1, 2, 3, 4, 5, 7, 6), Sol_3=c(1, 2, 3, 4, 7, 5, 6), Ref=c(1, 2, 3, 4, 5, 6, 7)) calculateCrossValidation(df, output_to_file = FALSE)
R interface to calculate the SRD distribution that corresponds to the data.
calculateSRDDistribution( data_matrix, option = "f", tie_probability = 0, output_to_file = FALSE )
calculateSRDDistribution( data_matrix, option = "f", tie_probability = 0, output_to_file = FALSE )
data_matrix |
A DataFrame. |
option |
A char to specify how ties are generated in the simulation. The following options are available:
|
tie_probability |
The probability with which ties can occur. |
output_to_file |
Boolean flag to enable file output. |
A List containing the SRD distribution and related descriptive statistics. xx1 value indicates the 5 percent significance threshold. SRD values falling between xx1 and xx19 are not distinguishable from SRD scores of random rankings, while an SRD score higher than xx19 indicates that the solution ranks the objects in a reverse order (with 5 percent significance).
Balázs R. Sziklai [email protected], Linus Olsson [email protected]
df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) calculateSRDDistribution(df, option = 'p', tie_probability = 0.5)
df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) calculateSRDDistribution(df, option = 'p', tie_probability = 0.5)
R interface to calculate SRD values. To test the results' significance run calculateSRDDistribution(). For more information about SRD scores and their validation see Héberger and Kollár-Hunek (2011).
calculateSRDValues(data_matrix, output_to_file = FALSE)
calculateSRDValues(data_matrix, output_to_file = FALSE)
data_matrix |
A DataFrame. |
output_to_file |
Boolean flag to enable file output. |
A vector containing the SRD values.
Balázs R. Sziklai [email protected], Linus Olsson [email protected]
Héberger K., Kollár-Hunek K. (2011) "Sum of ranking differences for method discrimination and its validation: comparison of ranks with random numbers", Journal of Chemometrics, 25(4), pp. 151–158.
df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) calculateSRDValues(df)
df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) calculateSRDValues(df)
Plots data generated by the calculateCrossValidation function as a boxplot. Includes max and min as whiskers as well as the average (marked by a crossed circle), median (marked by a horizontal bold line) and the 1st and 3rd quartile of the values. Visualizes outliers in the data as red triangles.
plotCrossValidation(cv_results)
plotCrossValidation(cv_results)
cv_results |
The List of results returned by the calculateCrossValidation function. |
None.
Linus Olsson [email protected], Alexander Pothmann
df <- data.frame( Sol_1=c(7, 6, 5, 4, 3, 2, 1), Sol_2=c(1, 2, 3, 4, 5, 7, 6), Sol_3=c(1, 2, 3, 4, 7, 5, 6), Ref=c(1, 2, 3, 4, 5, 6, 7)) cv_results <- rSRD::calculateCrossValidation(df, output_to_file = FALSE) rSRD::plotCrossValidation(cv_results)
df <- data.frame( Sol_1=c(7, 6, 5, 4, 3, 2, 1), Sol_2=c(1, 2, 3, 4, 5, 7, 6), Sol_3=c(1, 2, 3, 4, 7, 5, 6), Ref=c(1, 2, 3, 4, 5, 6, 7)) cv_results <- rSRD::calculateCrossValidation(df, output_to_file = FALSE) rSRD::plotCrossValidation(cv_results)
Heatmap is generated based on the pairwise distance - measured in SRD - of the columns. Each column is set as reference once, then SRD values are calculated for the other columns.
plotHeatmapSRD(df, output_to_file = FALSE, color = utilsColorPalette)
plotHeatmapSRD(df, output_to_file = FALSE, color = utilsColorPalette)
df |
A DataFrame. |
output_to_file |
Logical. If true, the distance matrix will be saved to the hard drive. |
color |
Vector of colors used for the image. Defaults to colors |
Returns a heatmap and the corresponding distance matrix.
Attila Gere [email protected], Linus Olsson [email protected], Jochen Staudacher [email protected]
srdInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) plotHeatmapSRD(srdInput) mycolors<- c("#e3f2fd", "#bbdefb", "#90caf9","#64b5f6","#42a5f5", "#2196f3","#1e88e5","#1976d2","#1565c0","#0d47a1") plotHeatmapSRD(srdInput, color=mycolors)
srdInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) plotHeatmapSRD(srdInput) mycolors<- c("#e3f2fd", "#bbdefb", "#90caf9","#64b5f6","#42a5f5", "#2196f3","#1e88e5","#1976d2","#1565c0","#0d47a1") plotHeatmapSRD(srdInput, color=mycolors)
Plots the permutation test for the given data frame by using the simulation data created by the calculateSRDDistribution() function.
plotPermTest(df, simulationData, densityToDistr = FALSE)
plotPermTest(df, simulationData, densityToDistr = FALSE)
df |
A DataFrame. |
simulationData |
The output of the calculateSRDDistribution() function. |
densityToDistr |
Flag to display the cumulative distribution function instead of the probability density. |
None.
Linus Olsson [email protected]
df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) simulationData <- rSRD::calculateSRDDistribution(df) plotPermTest(df, simulationData)
df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) simulationData <- rSRD::calculateSRDDistribution(df) plotPermTest(df, simulationData)
Calculates the distance of two rankings in $L_1$ norm and inserts the result after the first.
utilsCalculateDistance(df, nameCol, refCol)
utilsCalculateDistance(df, nameCol, refCol)
df |
A DataFrame. |
nameCol |
The current Column of the iteration. |
refCol |
The reference Column of the dataFrame. |
Returns a new df
that has a Distance Column based on the nameCol
.
Ali Tugay Sen, Jochen Staudacher [email protected]
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) nameCol <- "A" refCol <- "B" rSRD::utilsCalculateDistance(SRDInput,nameCol,refCol)
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) nameCol <- "A" refCol <- "B" rSRD::utilsCalculateDistance(SRDInput,nameCol,refCol)
Calculates the ranking of a given column.
utilsCalculateRank(df, nameCol)
utilsCalculateRank(df, nameCol)
df |
A DataFrame. |
nameCol |
The name of the column to be ranked. Note that this parameter needs to be specified as there is no default value. |
Returns a new df
that has an additional column with
the rankings of the column specified by nameCol
.
Jochen Staudacher [email protected]
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) columnName <- "A" rSRD::utilsCalculateRank(SRDInput,columnName)
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) columnName <- "A" rSRD::utilsCalculateRank(SRDInput,columnName)
Unique color palette for heatmaps.
utilsColorPalette
utilsColorPalette
An object of class character
of length 250.
Attila Gere [email protected], Balázs R. Sziklai [email protected],
barplot(rep(1,250), col = utilsColorPalette)
barplot(rep(1,250), col = utilsColorPalette)
Adds a new reference column based on the input DataFrame df and the given method. This function iterates over the rows and applies the given method to define the value of the reference. Available options are: max, min, median, mean and mixed. This column is appended to the DataFrame. When "mixed" is specified the function will consider the refVector for creating the reference column.
utilsCreateReference(df, method = "max", refVector = c())
utilsCreateReference(df, method = "max", refVector = c())
df |
A DataFrame. |
method |
A string value specifying the reference creating method. Available options: max, min, median, mean and mixed. |
refVector |
A vector of strings that specifies a method for each row. Vector size should be equal to the number of rows in the DataFrame df. |
Returns a new DataFrame appended with the reference column created by the method.
Ali Tugay Sen, Linus Olsson [email protected]
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) proc_data <- rSRD::utilsPreprocessDF(SRDInput) ref <- c("min","max","min","max","mean") rSRD::utilsCreateReference(proc_data, method = "mixed", ref)
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) proc_data <- rSRD::utilsPreprocessDF(SRDInput) ref <- c("min","max","min","max","mean") rSRD::utilsCreateReference(proc_data, method = "mixed", ref)
Detailed calculation of the SRD values including the computation of the ranking transformation. Unless there is a column specified with referenceCol the last column will always taken as the reference.
utilsDetailedSRD( df, referenceCol, createRefCol = function() { } )
utilsDetailedSRD( df, referenceCol, createRefCol = function() { } )
df |
A DataFrame. |
referenceCol |
Optional. A string that contains a column of |
createRefCol |
Optional. Can be max, min, median, mean. Creates a new Column based on the existing |
Returns a new DataFrame that shows the detailed SRD computation (ranking transformation and distance calculation). A newly added row contains the SRD values (displayed without normalization).
Ali Tugay Sen, Jochen Staudacher [email protected]
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) rSRD::utilsDetailedSRD(SRDInput)
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) rSRD::utilsDetailedSRD(SRDInput)
Detailed calculation of the SRD values including the computation of the ranking transformation.
Unless there is a column specified with referenceCol the last column will always taken as the reference.
This variant differs from utilsDetailedSRD
in that non-numeric columns will not be converted to chars,
i.e. the data types of non-numeric columns will be preserved in the output.
utilsDetailedSRDNoChars( df, referenceCol, createRefCol = function() { } )
utilsDetailedSRDNoChars( df, referenceCol, createRefCol = function() { } )
df |
A DataFrame. |
referenceCol |
Optional. A string that contains a column of |
createRefCol |
Optional. Can be max, min, median, mean. Creates a new Column based on the existing |
Returns a new DataFrame that shows the detailed SRD computation (ranking transformation and distance calculation). A newly added row contains the SRD values (displayed without normalization).
Ali Tugay Sen, Jochen Staudacher [email protected]
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) rSRD::utilsDetailedSRDNoChars(SRDInput)
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) rSRD::utilsDetailedSRDNoChars(SRDInput)
Calculates the maximum distance between two rankings of size n. This function is used to normalize SRD values.
utilsMaxSRD(rowsCount)
utilsMaxSRD(rowsCount)
rowsCount |
The number of rows in the SRD calculation. |
The maximum achievable SRD value.
Dennis Horn
maxSRD <- rSRD::utilsMaxSRD(5)
maxSRD <- rSRD::utilsMaxSRD(5)
This function preprocesses the DataFrame depending on the method
.
utilsPreprocessDF(df, method = "range_scale")
utilsPreprocessDF(df, method = "range_scale")
df |
A DataFrame. |
method |
A string that should contain "scale_to_unit", "standardize", "range_scale" or "scale_to_max". |
Returns a new df
that has a Distance Column based on the nameCol
.
Ali Tugay Sen, Dennis Horn [email protected], Linus Olsson [email protected]
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) method <- "standardize" utilsPreprocessDF(SRDInput,method)
SRDInput <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) method <- "standardize" utilsPreprocessDF(SRDInput,method)
R interface to perform the rank transformation on the columns of the input data frame. Ties are resolved by fractional ranking.
utilsRankingMatrix(data_matrix)
utilsRankingMatrix(data_matrix)
data_matrix |
A DataFrame. |
A DataFrame containing the ranking matrix.
Balázs R. Sziklai [email protected], Linus Olsson [email protected]
df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) utilsRankingMatrix(df)
df <- data.frame( A=c(32, 52, 44, 44, 47), B=c(73, 75, 65, 76, 70), C=c(60, 59, 57, 55, 60), D=c(35, 24, 44, 83, 47), E=c(41, 52, 46, 50, 65)) utilsRankingMatrix(df)
Calculates the tie probability for a given vector. The tie probability is defined as the number of consecutive tied component-pairs in the sorted vector divided by the size of the vector minus 1.
utilsTieProbability(x)
utilsTieProbability(x)
x |
A vector. |
Returns the tie probability as a numeric value.
Ali Tugay Sen, Linus Olsson [email protected]
x <-c(1,2,4,4,5,5,6) rSRD::utilsTieProbability(x)
x <-c(1,2,4,4,5,5,6) rSRD::utilsTieProbability(x)