Social Media, Data Mining & Machine Learning: July 2008

Google Translation + Flickr API = FlickrBabel

Posted by JoSeK at 8:32 AM . Tuesday, July 22, 2008

I'm actually involved in the development of a new StartUp, Wipley.com, a videogamers social network. As I've been "playing" with a lot of web applications APIs, I've had some ideas about integrating some of them for creating something that could be useful.

The first application I've developed for Wipley is FlickrBabel, a simple application that improves the search for photographies in Flickr by means of automated translation (Google translation API) and query expansion in order to search (by means of Flickr API) for a more general query. This method can be very useful for many people, specially non-english speakers, as Flickr (and many other web applications) is more used by English speakers than Spanish ones or, at least, there are more photographies tagged and described in English than in Spanish.

As a simple practical example, if you search "girasol" (the Spanish translation of sunflower) in Flickr, you may get over 6,200 results. If you search for "sunflower", you get more than 187,714 results. If you speak some English, you should use English instead of Spanish for performing your queries in Flickr. There are many other cases where English queries does not work as well as in the previous example. For instance, if you search for "omelette", you'll get over 11,000 results, but the Spanish translation, "tortilla", will get almost 30.000 results. FlickrBabel helps us by automatically translating our queries and performing the queries in both languages (I'll extend the functionality to other languages very soon).

Now, I'm working on several ways to relate photographies to other ones by means of contextual analysis. The application is at a beta stage but I'll appreciate any possible feedback given as a reply to this post or as a reply to the post we wrote in the official Wipley blog :D

Post-Summer Conferences

Posted by JoSeK at 7:10 PM . Saturday, July 19, 2008

0 comments

Labels: conference

After the summer, we (Franki and me) will be attending CIKM 2008 (in Napa Valley, California) and IDEAL 2008 (in Daejeon, Corea) presenting different parts of the work we are doing in SINAMED and ISIS projects.

Our ongoing research in this projects "is mainly focused on using biomedical concepts for cross-lingual text classification. In this context the use of concepts instead of bag of words representation allows us to face text classification tasks abstracting from the language". For cross-lingual text tasks, "we evaluate the possibility of combining automatic translation techniques with the use of biomedical ontologies to produce an English text that can be processed by MMTx", saving the efforts of developing a Spanish MetaMap.

I hope see you some of you in CIKM or IDEAL :)

KDD 2009 in Paris

Posted by JoSeK at 12:36 PM . Wednesday, July 16, 2008

0 comments

Labels: data mining, events

KDD comes to Europe, that's a great new for european dataminers :) From June 28 to July 1, KDD will be held at Paris. There are no key dates defined for the conference, but I suppose the Call for Papers will be by the end of January 2009.

JMLR: Workshop and Conference Proceedings

Posted by JoSeK at 10:43 PM . Tuesday, July 15, 2008

20 comments

Labels: journal, machine learning, open access

The Journal of Machine Learning Research (JMLR) is one of the leading journals in Machine Learning. Ranked the 7th in "Computer Science, Artificial Intelligence" category from the JCR, its impact factor is 2.682.

Beyond the quality of the journal and the papers published there, JMLR has represented a great initiative as the first quality Open Access journal in the Machine Learning field. From two years ago until now, JMLR tries to innovate with new initiatives like the support to the development of Open Source Machine Learning software or the recent creation of a special "Conference and Workshop Proceedings" series that aims publishing the work presented at Machine Learning Workshops and Conferences in an Open Access manner. These series have a ISSN (1938-7228) and is described by JMLR as follows

The JMLR: Workshop and Conference Proceedings series is a new series aimed specifically at publishing work presented at workshops and conferences. Each volume is separately titled and associated with a particular workshop or conference and will be pulished online on the JMLR web site. Authors will retain copyright and individual volume editors are free to make additional hardcopy publishing arrangments, but JMLR will not produce hardcopies of these volumes.

AUC as Performance Metric in ML

Posted by JoSeK at 3:48 PM . Friday, July 04, 2008

11 comments

Labels: machine learning, performance

ROC analysis is a classic methodology from signal detection theory used to depict the tradeoff between hit rates and false alarm rates of classifiers (Egan 1975, Swets 2000). ROC graphs has also been commonly used on medical diagnosis for visualizing and analyzing the behavior of diagnostic systems (Swets 1998). Spackman (Spackman 1989) was one of the first machine learning researchers to show interest in using ROC curves. Since then, the interest of the machine learning community in ROC analysis has increased, due in part to the realization that simple classification accuracy is often a poor metric for measuring performance (Provost 1997, Provost 1998).

The ROC curve compares the classifier's performance accross the entire range of class distributions and error costs (Provost 1997, Provost 1998). A ROC curve is a two-dimensional representation of classifier performance, which can be useful to represent some characteristics of the classifiers, but makes difficult to compare versus other classifiers. A common method to transform ROC performance to a scalar value, that is easier to manage, consists on calculate the area under the ROC curve (AUC) (Fawcett 2005). As the ROC curve is represented in a unit square, the AUC value will always be between 0.0 and 1.0, being the best classifiers the ones with a higher AUC value. As random guessing produces the diagonal line between (0,0) and (1,1), which has an area of 0.5, no real classifier should have an AUC less than 0.5.

Fig. 1. Example of ROC graphs, figure extracted from (Fawcett 2005). Subfigure a shows the AUC of two different classifiers. Subfigure b compares the graph of a scoring classifier B, and a discrete simplification of the same classifier, A.

Figure 1a shows two ROC curves representing two classifiers, A and B. Classifier B obtains higher AUC than classifier A and, therefore, it is supposed to behave better. Figure 1b shows a comparison between a scoring classifier (B) and a binary version of this classifier (A). Classifier A represents the performance of B when it is used with a fixed threshold. Though they represent almost the same classifier, A's performance measured by AUC is inferior to B. As we have seen, it can not be generated a full ROC curve from a discrete classifier, resulting in a less accurate performance analysis. Regarding this problem, in this paper we focus on scoring classifiers, but there are some attempts to create scoring classifiers from discrete ones (Domingos 2000, Fawcett 2001).

Hand and Till (Hand2001) present a simple approach to calculating the AUC of a given classifier.

REFERENCES

(Domingos 2000) P. Domingos, F. Provost, Well-trained PETs: Improving Probability Estimation Trees, 2000.
(Egan 1975) J. P. Egan, Signal Detection Theory and ROC Analysis. Series in Cognition and Perception. Academic Press, 1975.
(Fawcett 2001) T. Fawcett. Using rule sets to Maximize ROC performance. In IEEE International Conference on Data Mining, pp. 131-138, 2001.
(Fawcett 2005) T. Fawcett. An Introduction to ROC Analysis. Pattern Recognition Letters, 27:861-874, 2005.
(Hand 2001) D. J. Hand, R. J. Tiller, A Simple Generalization of the Area under the ROC Curve to Multiple Class Classification Problems. Machine Learning, 45(2), pp. 171-186, 2001.
(Provost 1997) F. Provost, T. Fawcett, Analysis and Visualization of Classifier Performance. In Proceedings of the 13th Intenational Conference on Knowledge Discovery and Data Mining, pp. 43-48. AAAI Press, 1997.
(Provost 1998) F. Provost, T. Fawcett, R. Kohavi. The Case Against Accuracy Estimation for Comparing Induction Algorithms. In Proceedings of the Fifteenth International Conference on Machine Learning, pp. 445-453.
(Spackman 1989) K. A. Spackman. Signal Detection Theory: Valuable Tools for Evaluating Inductive Learning. In Proceedings of the Sixth International Workshop on Machine Learning, pp. 160-163. 1989.
(Swets 1998) J. A. Swets, Measuring the Accuracy of Diagnosis Systems. Science (240):1285-1293, 1988.
(Swets 2000) J. A. Swets, R. M. Dawes, J. Monahan, Better Decision Through Science, Scientific American Magazine, October 2000.

Social Media, Data Mining & Machine Learning

Google Translation + Flickr API = FlickrBabel

Post-Summer Conferences

KDD 2009 in Paris

JMLR: Workshop and Conference Proceedings

AUC as Performance Metric in ML

Labels

Blog Archive

Related Blogs