Paper: "The Role of Occam's Razor in Knowledge Discovery"

. Friday, April 21, 2006

Reference: “The Role of Occam's Razor in Knowledge Discovery.” Data Mining and Knowledge Discovery, 3, 409-425, 1999.


I found this paper, by Pedro Domingos, very interesting. It analyzes the role developed by the Occam’s Razor in KDD and studies (referencing a lot of papers where Occam’s Razor is explicitly or implicitly applied) the correctness of applying it.

Pedro Domingo says the Occam’s Razor in KDD can be seen as

  1. “Given two models with the same generalization error, the simpler one should be preferred because simplicity is desirable in itself” or...
  2. ...“Given two models with the same training-set error, the simpler one should be preferred because it is likely to have lower generalization error.”

The second version of the Occam’s Razor is clearly mistaken, as a low training-set error usually derives into overfitting, and a high generalization error. Pedro Domingos gives some arguments to favor and against this second one, concluding that “the second version is provably and empirically false”. The first version seems right but uses simplicity as a proxy to comprehensibility, resulting in some quite different from Occam’s Razor.

Ensemble Learning (A very short introduction)

. Thursday, April 20, 2006

Ensemble methods are methods than rather than finding one best hypothesis for the learning problem construct a set of hypotheses (called "committee" or "ensemble") that can, in some fashion, "vote" to predict the class of the new data example. As defined in [1], "an ensemble method constructs a set of hypothesis {h1, . . . , hk}, chooses a set of weights {w1, . . . , wk} and constructs the "voted" classifier

The classification decision of the combined classifier is +1 if H(x) >= 0 y -1 otherwise"

Ensemble learning algorithms work running a base learning algorithm multiple times, generating a hypothesis at each iteration. The hypothesis can be generated in two ways

1) In a independent way, resulting in a set of diverse hypothesis. It can be done by running the algorithm several times and providing different training data at each iteration or providing different subsets of the input features at each iteration. An example of this way is Bagging [2].
2) In an additive way, adding one hypothesis at a time to the ensemble, and constructing the hypothesis trying to minimize the classification error on a weighted training data set. An example of this way is Boosting [3], [4].

Bagging (“Bootstrap Aggregatting”)

Given a training data with m examples, Bagging works by generating at each iteration a new training data set of m examples by sampling uniformly from the original examples, which means some original examples appear multiple times and other original examples do not appear in the resampling. If the learning algorithm is unstable (small changes in the training data causes large changes in the hypothesis, like decision trees) then Bagging will produce a set of diverse hypothesis (H).

Having H, for classifying a new example, we should proceed to a voting between all the hypothesis hi belonging to H, and the final classification would be the most voted one.


Boosting assings a weight to each training sample and then, at each iteration, generates a model that tries to minimize the sum of the weights of the misclassificated examples. The errors at each iteration are used to actualize the weights of the training samples, increasing the misclasificated instances’ weights and decreasing the correctly classificated instances’ weights.

¿Why use Ensembles?

[5] gives a good justification of ensemble using.

  1. "The first cause of the nees for ensembles is that the trainning data may not provide sufficient information for choosing a single best classifier from H. Most of our learning algorithms consider very large hypothesis spaces, so even after eliminating hypotheses that misclassify training examples, there are many hypothesis remaining".
  2. "A second that our learning algorithms may not be able to solve the difficult search problems that we pose"
  3. "A third that our hypothesis space H may not contain the trus function f"

Basic References
[1] Dietterich, T. G. (2002). "Ensemble Learning." In The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. 405-408. (Download)

Specific References
[2] Breiman, L. (1996a), "Bagging Predictors", Machine Learning, Vol. 24, No. 2, pp. 123-140. (Download)
[3] Yoav Freund and Robert E. Schapire. "Experiments with a new boosting algorithm." In Machine Learning: Proceedings of the Thirteenth International Conference, pages 148-156, 1996. (Download)
[4] J. R. Quinlan, "Bagging, Boosting, and C4.5 (1996)", AAAI/IAAI, Vol. 1 (Download)

Other References:
[5] Dietterich, T. G, "Machine Learning: Four Current Directions" (1997) (Download)

Ensemble Learning Researchers:

A reading list on Bayesian methods


Tom Griffiths mantains a list of papers related to Bayesian methods. It seems a good place for starting Bayesian researchers:

Bayesian Model Averaging (A very simple description)


When trying to select a model that explains the data, we usually select (from all the possible models) the one which better fits the data. But sometimes we have some model that explains really well the data, creating a model selection uncertainty, which is usually ignored. BMA (Bayesian Model Averaging provides a coherent mechanism for accounting for this model uncertainty, combining predictions from the different models according to their probability.

As an example, consider that we have an evidence D and 3 possible hypothesis h1, h2 and h3. The posterior probabilities for those hypothesis are P( h1 | D ) = 0.4, P( h2 | D ) = 0.3 and P( h3 | D ) = 0.3. Giving a new observation, h1 classifies it as true and h2 and h3 classify it as false, then the result of the global classifier (BMA) would be calculated as follows:

Basic BMA Bibliography

[1] J. A.and Madigan D. Hoeting and A.E.and Volinsky C.T. Raftery. Bayesian model averaging: A tutorial (With Discussion). Statistical Science, 44(4):382--417, 1999. (Download)

Basic BMA Researchers

Composite Methods for Machine Learning


The Occam's razor is present everywhere, even in Machine Learning, where when some models fit the training data we usually select the simplest one as the model we are going to use. As said in [1] the Greek philosopher Epicurus defended a thesis far away from Occam's Razor, he defended that if two or more hypothesis fit the data, we should use all of them, not only one of them. Composite methods are machine learning techniques/algorithms where we use more than one model/hypothesis/classifier to obtain a better result in our machine learning task. Inside Composite Methods there are very different methods, resumed in the next more or less general techniques:

1.- Bayesian Model Averaging

2.- Ensemble Methods
2.1.- Bagging
2.2.- Boosting
2.3.- Other Fussion Methods

3.- Mixture of Experts
3.1.- Gating Networks

4.- Local Learning Algorithms
4.1.- RBF
4.2.- KNN

5.- Hybrid Methods
5.1.- Cascade
5.2.- Stacking

6.- Composite Methods for Scaling or Distributing ML algorithms

I'm going to spend some post describing, in a general way, all this techniques.

[1] "Introducción a la Minería de Datos" José Hernández Orallo, M.José Ramírez Quintana, Cèsar Ferri Ramírez, Pearson, 2004. ISBN: 84 205 4091 9