Computational intelligence methods
and data understanding

UMK - logo

Włodzisław Duch

duch

Computational Intelligence Laboratory,
Department of Informatics,
Nicolaus Copernicus University,

Grudziądzka 5, 87-100 Toruń, Poland.
e-mail: id: wduch, na serwerze fizyka.umk.pl.

WWW: http://www.fizyka.umk.pl/~duch



With thanks to Yoichi Hayashi, Meiji University, Tokyo,

and members of my group:
Norbert Jankowski, Antoine Naud, Rafał Adamczak, Krzysztof Grąbczewski and Karol Grudziński


Plan

  1. Understanding the data and logical explanations
  2. Rules and probabilities
  3. Visualization-based explanation
  4. Examples of real-life applications
  5. Conclusions


Understanding the data and logical explanations

Goal of Computational Intelligence: not only prediction, diagnosis, classification, but also understanding of the data, explanation, reliability assesment, knowledge discovery, building theories and causal models.

What type of explanation is satisfactory?
Interesting cognitive psychology (CS) problem.
Exemplar and prototype theories of categorization in CS.

Types of explanation:

"Best explanation" - depends on particular field.

Logical rules, if simple enough, are preferred by humans.

Rules require symbols, defined by predicate functions Property(X)Î [0,1].
Perception:

Types of logical rules:

Crisp logic rules:
for continuos X use linguistic variables Property(X) = 0 or 1 (F/T):

IF small-height(X) AND has-hat(X) AND has-beard(X) THEN (X is a Brownie)
ELSE IF ... ELSE ...

Crisp logic based on rectangular membership functions: True/False values jump from 0 to 1.
Step functions are used for partitioning of the input space.

Decision regions: hyperrectangular (cuboidal).

The number of problems that one may analyze using crisp logic may be limited.

Crisp logic rules are most desirable; try them first,
but remember ...




Fuzzy logic rules

Property(X)Î [0,1] instead of 0 or 1.
Typical approach: define triangular, trapezoidal, Gaussian and other type of predicate (membership) function.
Membership function m(X) - degree of truth that X has the property m

Frequently features are partitioned by membership functions, defining large receptive fields, reducing information (?).
For example, using triangular functions.

Fuzzy logic: separable functions - products of one-dimensional factors:

Many other possibilities exist to produce N-dimensional membership functions.
Fuzzy rules have continuous membership functions, giving some advantages:

But remember ...

Example: Ljubliana breast cancer

286 cases, 201 no recurrence cancer events (70.3%), 85 are recurrence (29.7%) events.
9 features, symbolic with 2 to 13 values.

Single logical rule:

IF involved nodes >2 AND Degree-malignant=3 THEN recurrence
ELSE norecurrence

gives over 77% accuracy in crossvalidation tests;
best classification systems do not exceed 78% accuracy (insignificant difference).
Some fuzzy/rough systems provided 200 rules with CV accuracy close to the base rate!
Reason: too complex, hard to find simplest solution.

All knowledge contained in the data is:
IF more than 2 areas were infected
AND cancer is highly malignant
THEN there will be recurrence.

Clusters may have complex decision border shapes.

IF XÍC THEN Factk = TRUE

Simple rules - non-linear feature transformations may be required.
Neural networks, kernel methods (like SVM): nonlinear mapping from the feature space to the image space aiming at linear separability.



Problems with rule-based classification models:




Rules and probabilities

Simplest things first: try crisp rules, large information reduction.

Data from measurements/observations are not precise.
Finite resolution, small receptive fields: Gaussian error distribution:

x => Gx=G(y;x,sx), where Gx is a Gaussian (fuzzy) number.

Given a set of logical rules {Â} apply them to input data {Gx }.
Use Monte Carlo sampling to recover p(Ci | Gx; {Â }) - may be used with any classifier.

The rule Ra(Gx) = True{x|x>a} is true with probability:

because the cumulant function is approximated by the logistic function:

for within 2%

Soft trapezoidal membership functions (MF) realized by neural L-units are obtained.

Fuzzy logic with soft trapezoidal MF and crisp inputs is equivalent to crisp logic with Gx inputs;
MLP neural networks with logistic transfer functions provide such MFs.
MLP model includes FL with trapezoidal membership functions.

If rules are overlapping and conditions are correlated the formula leading to Monte Carlo results is:

where 2Rc are all subsets of the set of classification rules for class C and |R| is the number of rules.
This formula simply avoids counting the same regions twice.

This is not a fuzzy approach!
Here small receptive fields are used, in fuzzy approach a few large receptive fields define linguistic variables.

Benefits:

  1. Probabilities instead 0, 1 crisp rule decisions.
  2. Vectors that were not classified by crisp rules have now non-zero probabilities.
  3. Dispersions Sx may be treated as adaptive parameters of the model M.
  4. Gradient methods may be used for large-scale optimization.

Next step: optimization of rules exploring the confidence-rejection rate tradeoff.

Define confusion matrix F(Ci,Cj|M) counting the number of cases from class Cj assigned by the set of rules M to the class Ci.

For 2 class problems:

F(Ci,Cj|M) =
F++  F+-
F-+  F--

where Fij is the frequency of cases Nij / N.
F(Ci,Cj |R) may be defined for a rule or a (sub)set of rules.

Ideal: only the diagonal confusion matrix elements summing to N should be left.

Define weighted combination of the number of errors and the "predictive power" of rules:

This should be minimized without constraints; it is bound by -N (number of all training vectors).

Sets of rules M are parameterized by Xk, X'k intervals.
For g=0 predictive power of rules is maximized.
Rules that make fewer errors on the training data should be more reliable.
Cost function E(M; g) allows to reduce the number of errors to zero (large g) for rules M that reject some instances.




Visualization-based explanation

How to understand the data?

Explanatory data analysis - show the data.
Kohonen's topographics SOM - most popular, trying to classify/display at the same time, but poor in both respects.

Assesing the new data:

Investigate the dependence of p(C|GX; M,S) on S.
If new classes with significant probability appear analyse dependence of p(C|GX; M,S) on S on individual features Xi.

Example:
with Si =0 "organic problems'' with 100% confidence;
with 1% uncertainty (optimal, lowest error) "organic problems'' with 90% confidence.
Probabilities of different classes are shown as a function of assumed uncertainties.
Si=r [Maxi- Mini], horizontal axis shows r in percents of the input feature range, vertical axis shows probability in percents.

In ambiguous cases probabilities of several new classes quickly rise for growing r and of the most probable class quickly decrease.

PCI, Probabilistic Confidence Intervals

If probability of new classes quickly grows with the assumed uncertainty of the measurement analyze probabilistic confidence levels.
Assume that the value of a single feature is changed, all other are constant; display the two largest probabilities.

A mixed class appears: paranoidal schizophrenia.

Sensitivity analysis: which feature has strongest influence on the decision?
Shows probabilities in the neighborhood of the case analyzed for a single feature.
Allows to evaluate reliability of classification, but not to explain the data.
May be used with any classifier.
May illustrate the trajectory of changes:

IMDS, Interactive MultiDimensional Scaling

Shows interactively the neighborhood of the case analyzed trying to preserve all topographic relations.
Minimization of a Stress function as:

where wij are weights allowing to control which distances are to be better preserved.

Example: focusing on a data point 'p554' from class 'organika'
- Purpose: view (understand) why this data is classified into class 'organika'.
Classified using IncNet neural network, for which features 2, 4 and 7 are sufficient to classify correctly class 'organika'.
Progressive zooming by mapping successively the 200, 100, 50 and 20 nearest data points (interactively selected) from point 'p554' (marked by a black circled dot).

200 nearest neighbors

50 nearest neighbors




Examples of real-life applications

An example of a system based on rules derived from data.

Analysis of psychometric questionnaires

An example of data where logical rules do not work.

Indian Telugu vowel sounds

871 patterns, 6 overlapping vowels, 3 features (formant frequencies).
No reason why simple rules should exist in this case.

Method
Test
Reference
10xCV tests below
3-NN, Manhattan 88.1 ±0.4 Kosice
FSM, 65 Gaussian nodes 87.4 ±0.5 Kosice
SSV dec. tree, 22 rules 86.0 Kosice
2xCV tests below
3-NN, Euclidean 86.1 ±0.6 Kosice
FSM, 40 Gaussian nodes 85.2 ±1.2 Kosice
MLP 84.6 Pal
Fuzzy MLP 84.2 Pal
SSV dec. tree, beam search 83.3 ± 0.9 Kosice
Bayes Classifier 79.2 Pal
Fuzzy SOM 73.5 Pal

More than 20 rules give lower accuracy than kNN.
Simplest explanation: 6 FSM prototypes (one per class), Gaussian weighting of the Euclidean distance.
10xCV gives 72.4% accuracy - insufficient info in the 3 formants.




Conclusions

Understanding the data is possible through:

Rules: frequently rule sets are unstable and/or rule-based classification is unstable.
Small number of accurate rules - best explanation?

Interpretation/evaluation of new cases:

Class probabilities may always be estimated by Monte Carlo.
For crisp and most fuzzy rule sets analytical formulas exist.


References

Most papers are available from these pages
http://www.fizyka.umk.pl/kmk/publications.html
http://www.is.umk.pl/~duch/cv/papall.html


Włodzisław Duch