Social Media, Data Mining & Machine Learning: June 2008

Performance Metrics

Posted by JoSeK at 7:44 PM . Monday, June 30, 2008

Performance metrics are values calculated from the predictions of the classifiers that allow us to validate the classifier's model. Definitions of these performance metrics are usually calculated from a confusion matrix. The figure 1 shows a confusion matrix for a two-class problem, that serves as example for describing the basic performance metrics. In the figure

π₀ denotes the a priori probability of class (+).
π₁ denotes the a priori probability of class (-); π₁ =1-π₀
p₀ denotes the proportion of times the classifier predicts class (+).
p₁ denotes the proportion of times the classifier predicts class (-); p₁=1-p₀.
TP is the number of instances belonging to class (+) that the classifier has correctly classified as class (+).
TN is the number of instances belonging to class (-) that the classifier has correctly classified as class (-).
FP is the number of instances that, belonging to class (-), the classifier has classified as positive (+).
FN is the number of instances that, belonging to class (+), the classifier has classified as negative (-).

Fig. 1: Confusion matrix that generates the needed values for standard performance metrics

The precision is the percentage of true positive instances from all the instances classified as positive by the classifier; precision=TP/(TP+FP).The accuracy is the percentage of correctly classified instances; accuracy=(TP+TN)/π₁.There are other approximations to estimate the classifier's performance that are used when dealing with a large set of classes. One of those approaches is F_β that tries to compensate the effect of no uniformity in the instances' distribution among the classes. F_β is calculated as follows

Van Rijsbergen in (vanRijsbergen, 1979) states that F_β measures the effectiveness of retrieval with respect to a user who attaches $\beta$ times as much importance to recall as precision. One of the most typical uses of F_β is the harmonic mean of precision and recall, F₁.Traditionally, evaluation metrics like recall, precision and F_β have been largely used by the Information Retrieval community. Classification accuracy has been the standard performance estimator in Machine Learning for years. Recently, the area under the ROC (Receiver Operating Characteristics) curve, or simply AUC, traditionally used in medical diagnosis, has been proposed as an alternative measure for evaluating the predictive ability of learning algorithms.

REFERENCES

(van Rijsbergen, 1979) C. V. van Rijsbergen, "Information Retrieval", Butterworth, 1979.
(Fawcett, 2005) T. Fawcett, "An introduction to ROC analysis", Pattern Recognition Letters 27, pp. 861-874, 2005.
(Provost 1997) F. Provost, T. Fawcett, "Analysis and Visualization of Classifier Performance", Proceedings of the 13th International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp. 43-48, 1997.

The Need for Open Source Software in Machine Learning

Posted by JoSeK at 7:47 PM . Wednesday, June 25, 2008

3 comments

Labels: machine learning, software

Reading Undirect Grad blog, I found an interesting paper about the need of more Open Software in Machine Learning. The abstract:

Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the ﬁeld of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be signiﬁcantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientiﬁc community.

I think this paper addresses a very interesting problem, not only for the ML community. As said in the paper, "Open Source model allows better reproducibility of the results, quicker detection errors, innovative applications, faster adoption of ML methods in other disciplines", but it also avoids a constant reinvention of the wheel, and is a fairer model because if most of the researchs are funded by public money, why should researchers stop the access to the code?

The same happens with publications. Open Access should be a neccesary condition for every public funded research. Luckily, there are several iniciatives all around the globe trying to spread the benefits of the Open Access model, as Harvard's addoption of Open Access or the support of the Comunidad de Madrid (a Spanish region) to several Open Access iniciatives (sorry for the link in Spanish).

In recent years, the ML community has improve in this aspects. We count on a very good Open Source ML framework as Weka, we have a top Open Access Journal as JMLR that also supports ML Open Source software and a very good Open Source software repository like MLOSS.

Automated Microarray Classification Challenge

Posted by JoSeK at 12:29 PM . Tuesday, June 24, 2008

0 comments

Labels: cfps, events, machine learning, medical

The diagnosis of cancer on the basis of gene expression profiles is well established, so much so that micro-array classification has become one of the classic applications of machine learning in
computational biology. The field has now reached the stage where a large scale evaluation exercise is warranted to determine the advantages and disadvantages of competing approaches. We have therefore organized a challenge for ICMLA'08, the aim of which is to determine the best fully automated approach to micro-array classification. An unusual feature of the competition is that instead of submitting predictions on test cases, the competitors submit a MATLAB implementation of their algorithm (R and Java interfaces are also in development), which is then tested off-line by the challenge organizers. This will test the true operational value of the method, in the hands of an end user who is not necessarily an expert in a given technique. The winner of the challenge will receive a free registration to ICMLA'08.

Further details and background information regarding the competition are available from the challenge website, http://theoval.cmp.uea.ac.uk/~gcc/projects/amcc. If you have any questions, please feel free to contact the challenge organizers (g...@cmp.uea.ac.uk).

The results of the challenge will be presented at a special session at ICMLA'08. Competitors are encouraged to participate in the special session and are invited to submit a technical paper describing their technique. Submissions should be made electronically in PDF format using the central ICMLA'08 website. The deadline for submissions is June 15, 2008. All accepted papers must be presented by one of the authors in order to be published in the conference proceeding.

Important Dates

Challenge opens March 10, 2008
Challenge closes Julu 15, 2008
Paper submission due July 15, 2008
Notification of acceptance September 1,
2008
Camera-ready papers & pre-registration October 1, 2008
ICMLA'08 conference December 11-13, 2008

Special Session Chair

Dr Wenjia Wang, University of East Anglia, Norwich, U.K.

Special Session Organizers

Dr Gavin Cawley, University of East Anglia, Norwich, U.K.
Dr Wenjia Wang, University of East Anglia, Norwich, U.K.
Mr Geoffrey Guile, University of East Anglia, Norwich, U.K.

KDD Cup 2008 and Workshop on Mining Medical Data

Posted by JoSeK at 12:21 PM .

0 comments

Labels: cfps, data mining, events, medical

KDD Cup is the first and the oldest data mining competition, and is an integral part of the annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Based on data provided by Siemens Medical Solutions USA, this year's KDD Cup competition focuses on the early detection of breast cancer from X-ray images of the breast. We are looking forward to an interesting competition and your participation. We particularly encourage the participation of students.

There are 2 different parallel options for participating:

Submit entries to the KDD Cup competition
Paper submissions for the associated Workshop on MiningMedical Data

Further details on each option are provided below.

KDD Cup 2008

Siemens Medical Solutions is proud to provide the data for the KDD Cup 2008 competition. The competition focuses on the early detection of breast cancer from X-ray images of the breast. There are two specific tasks, selected to be interesting to participants from academia and industry. The tasks are described in detail at www.kddcup2008.com. You can choose to compete in either or both of the tasks. The training data can be downloaded after April 3, 2008. Important dates are listed below.

April 1 Web site up. Registration opens
April 3 Training data and evaluation code available after login
June 2 Test data available for download after login
June 20 Registration for KDD Cup closes
July 7 Last date for submission of results on test set
July 15 Notification of KDD Cup competition results
July 31 Winners submit their camera ready papers to the workshop
August 24-27 Winners present their work at the workshop.

Workshop on Mining Medical Data

We invite the submission of papers related to mining medical data. Participants in the KDD Cup 2008 may optionally submit papers to this workshop describing their entry. However, the workshop is broader in scope, and we also welcome other submissions related to the mining of
medical data from structured sources such as structured databases and from unstructured data sources such as medical images, textual notes, etc. We particularly invite papers describing systems that are able to combine all available patient information whether from structured sources or from unstructured sources, to support medical decision making.

All submitted papers will be evaluated by the workshop program committee based on scientific merits and novelty as perceived by the committee. Accepted papers will appear in the workshop proceedings. Authors of the accepted papers are required to present their papers at the workshop. Depending on interest, a subset of the selected papers may also be published in a special issue of a journal later on. Important dates are listed below.

All submitted papers must be in PDF format, must be restricted to 4 pages, and must use the template found at http://www.acm.org/sigs/publications/proceedings-templates.

July 7 Last date for submitting papers for the workshop
July 28 Author Notification about Accepted papers
July 31 Final Camera ready papers due
August 24-27 Authors of accepted papers present their work.

Usama Fayyad quits Yahoo

Posted by JoSeK at 11:55 AM .

2 comments

Labels: news, people

Before joining Yahoo!, Dr. Usama Fayyad worked 5 years in Microsoft Research and building data mining solutions for Microsoft's servers division. From 1989 to 1996, Usama held a leadership role at NASA's Jet Propulsion Laboratory (JPL) where. In 2000, he co-founded and served as CEO of digiMine Inc. (now Revenue Science Inc.), a data analysis and data mining company.

Dr. Fayyad has been in Yahoo! for more than 4 years, being chief data officer and executive vice president of research and strategic data solutions. From that position , Fayyad has been the responsible for Yahoo!'s overall data strategy, the Yahoo!'s architecting data policies and systems, and the manager of Yahoo!'s data analytics and data processing infrastructure.

On June 12, New York Times Bits reported that

Mr. Fayyad told his staff yesterday that he would be leaving and his departure is expected to be officially announced later today. Mr. Fayyad was the data guru at Yahoo, the person in charge of mining the terabytes of data collected by the company to improve things like the targeting of ads and content to Yahoo users. He was also in charge of Yahoo’s well-respected research organization.

Gregory Piatetsky-Shapiro reported in KDnuggets some interesting words from Usama Fayyad, where he says it is a good time to quit Yahoo! as his team will be able to continue his work. Usama seems to want starting a new company taking advantage of his data mining knowledges and the huge vision about Internet, search, advertising and the future of interactive media that Yahoo! has offered to him.

With this announcement, Usama joins to many other Yahoo! execs that are actually trying to "run away" from Yahoo!.

Computational Linguistics (CL) goes Open Access

Posted by JoSeK at 9:23 AM . Thursday, June 19, 2008

0 comments

Labels: journal, open access

Hal announces that CL journal would be open access from the first issue of the next year. There will be no print version of the journal and the electronic version will be Open Access.

The existence of an importan Open Access journal related to Computational Linguistics has been a discussion topic last years. On May 2007, Hal published the post "Whence JCLR?" where he discussed about the existence of the JMLR Journal, an Open Access Machine Learning journal that is one of the key journals for the ML community.

It is really a very good new for the CL community.

The Discipline of Machine Learning

Posted by JoSeK at 11:07 PM . Tuesday, June 17, 2008

0 comments

Labels: machine learning, people

Tom Mitchell is one of the key personalities of Machine Learning discipline. He has been working in this area since the end of the 70's, published some reference ML textbooks and, first of all, he is the head of the first Machine Learning department all around the world.

In 2006, when he was "fighting" for the creation of the ML department at the Carnegie Mellon University, he was said that "you can only have a department if you have a discipline that is going to be here in one hundred years otherwise you can not have a department". For stating that ML would last more that a hundred years, he wrote a white paper, "The Discipline of Machine Learning", that is a real must-read paper for all the people interested in ML. The abstract of the paper states

Over the past 50 years the study of Machine Learning has grown from the efforts of a handful of computer engineers exploring whether computers could learn to play games, and a field of Statistics that largely ignored computational considerations, to a broad discipline that has produced fundamental statistical-computational theories of learning processes, has designed learning algorithms that are routinely used in commercial systems for speech recognition, computer vision, and a variety of other tasks, and has spun off an industry in data mining to discover hidden regularities in the growing volumes of online data. This document provides a brief and personal view of the discipline that has emerged as Machine Learning, the fundamental questions it addresses, its relationship to other sciences and society, and where it might be headed.

Tom also gave a speech related to this matter at the Carnegie Mellon University School of Computer Science's Machine Learning Department in March 2007. You can watch Mitchell's speech in this video.

ECML PKDD Discovery Challenge 2008

Posted by JoSeK at 7:32 PM . Monday, June 16, 2008

0 comments

Labels: challenge, conference

This year, the ECML/PKDD's discovery challenge is set about social bookmarking. There are two main tasks: Spam Detection in Social Bookmarking Systems and Tag Recommendation in Social Bookmark Systems. This challenge is organized in conjunction with the Web 2.0 Mining workshop, and seems very interesting. Test data set will be released on July 30th, there is enough time to try something :)

Interviews

Posted by JoSeK at 8:07 PM . Wednesday, June 11, 2008

0 comments

Labels: people

Some interesting interviews to important people from DM&ML communities. Thanks to VideoLectures for hosting all that interesting stuff.

Dr. Usama Fayyad is responsible for Yahoo!'s overall data strategy, architecting Yahoo!'s data policies and systems, prioritizing data investments, and managing the Company's data analytics and data processing infrastructure.

Tom Mitchell is the first Chair of Department of the first Machine Learning Department in the World, based at Carnegie Mellon.

Gregory Piatetsky-Shapiro, Ph.D. is the President of KDnuggets, which provides research and consulting services in the areas of data mining, knowledge discovery, bioinformatics, and business analytics

Social Media, Data Mining & Machine Learning

Performance Metrics

The Need for Open Source Software in Machine Learning

Automated Microarray Classification Challenge

KDD Cup 2008 and Workshop on Mining Medical Data

Usama Fayyad quits Yahoo

Computational Linguistics (CL) goes Open Access

The Discipline of Machine Learning

ECML PKDD Discovery Challenge 2008

Interviews

Labels

Blog Archive

Related Blogs