This section provides a summary of current OpenTox Feature Selection components.
Feature Selection Components
CFS is a correlation-based filter method. It gives high scores to subsets that include features
that are highly correlated to the class attribute but have low correlation to each other Let S be an attribute
subset that has k attributes, rcf models the correlation of the attributes to the class attribute, rff the
intercorrelation between attributes.
Feature Selection via the chi square (X2) test is a commonly used method. The X2 method
evaluates features individually by measuring their chi-squared statistic with respect to the classes.
The FCBF (Fast Correlation-Based Filter) algorithm consists of two stages: the first one is a relevance analysis,
aimed at ordering the input variables depending on a relevance score, which is computed as the symmetric
uncertainty with respect to the target output. This stage is also used to discard irrelevant variables, which are
those whose ranking score is below a predefined threshold. The second stage is a redundancy analysis, aimed
at selecting predominant features from the relevant set obtained in the first stage. This selection is an iterative
process that removes those variables which form an approximate Markov blanket.
Information Gain Attribute Evaluation evaluates the worth of an attribute by measuring the information gain with respect to the
The Principle Component Analysis (PCA) is mathematically defined as an orthogonal linear transformation that
transforms the data to a new coordinate system such that the greatest variance by any projection of the data
comes to lie on the first coordinate, the second greatest variance on the second coordinate and so forth. The
coordinates are here called principal components.
The wrapper approach depends on the classifier that should be used with the resulting attribute subset.
Wrapper methods evaluate subsets by running the classifier on the training data, using only the attributes of
the subset. The better the classifier performs, usually based on cross-validation, the better is the selected
attribute set. One normally uses the classification-accuracy as the score for the subset.