CFP: 25th Conference of the SEPLN (Spanish Society for Natural Language Processing)

. Thursday, December 18, 2008

September 8-10, 2009
Palacio Miramar, Donostia - San Sebastián


The 25th edition of the Annual Conference of the Spanish Society for Natural Language Processing (SEPLN) will take place in the Miramar Palace in San Sebastian on September 8, 9 and 10, 2008.

We also expect to organise three satellite workshops during the week of the conference (see list of workshops).

The huge amount of information available in digital format and in different languages demands systems that enable us to access this vast library in an increasingly more structured way.

In this same area, there is a renewed interest in improving information accessibility and information exploitation in multilingual environments. Many of the formal foundations for dealing appropriately with these necessities have been, and are still being established in the area of Natural Language Processing and its many branches:

  • Information extraction and retrieval, Questions answering systems,
  • Machine Translation, Automatic analysis of textual content, Text
  • Generation, and Speech recognition and synthesis.

The aim of the conference is to provide a forum for discussion and communication where the latest research work and developments in the field of Natural Language Processing (NLP) can be presented by scientific and business communities. The conference also aims at exposing new possibilities of real applications and R&D projects in this field.

Moreover, as in previous editions, there is the intention of identifying future guidelines or paths for basic research and foreseen software applications, in order to compare them against the market needs. Finally, the conference intends to be an appropriate forum in helping new professionals to become active members in this field.


Researchers and companies are encouraged to send communications, project abstracts or demonstrations related to any of the following language technology topics:
  • Linguistic, mathematic and psycholinguistic models of language
  • Corpus linguistics
  • Development of linguistic resources and tools
  • Grammars and formalisms for morphological and syntactic analysis
  • Semantics, pragmatics and discourse
  • Lexical ambiguity resolution
  • Machine Learning in NLP
  • Monolingual and multilingual text generation
  • Machine translation
  • Speech synthesis and recognition
  • Monolingual and multilingual information extraction and retrieval
  • Question answering systems
  • Automatic textual content analysis
  • Text summarization
  • NLP-based generation of teaching resources
  • NLP for languages with limited resources
  • NLP industrial applications


The conference will last three days, and will consist of sessions devoted to presenting papers, posters, ongoing research projects, prototype product demonstrations or products connected with topics addressed in the conference. Besides, we expect to organize three satellite workshops during the week of the conference.


The proposal must be submitted earlier than April 24, 2009 and they must meet certain format and style requirements.

Both the delivery and revision of proposals will be done exclusively in PDF electronic format via the Myreview system. We recommend using the LaTeX and Word templates that can be downloaded from the conference webpage.

Besides, the proposals will have to comply the following requirements, depending if they are communications, demos or projects.


Authors are encouraged to send theoretical or system-related proposals.

The proposals must include the following sections:

  • A title of the communication.
  • The complete names of the authors, their affiliations, address, and e-mail (anonymous in the submitted proposal).
  • An abstract in English and Spanish (maximum 150 words), including a list of keywords or related topics.
  • The proposal can be written and presented in Spanish or English, and its overall maximum length will be 8 pages, excluding references, which can take up an additional whole page at the most.
  • The documents must not include headings or footnotes.

The papers proposed will be assessed at least by three reviewers, and can be accepted to be presented either as posters or as communications, depending on the program necessities. However, no distinction will be made between communications and posters in the printed version of the SEPLN magazine.


As in previous editions, the organizers encourage participants to give oral presentations of R&D projects and demos of systems or tools related to the NLP field. For oral presentations on R&D projects to be accepted, the following information must be included:

  • Project title
  • Name, affiliation, address, e¬mail and phone number of the project director
  • Funding institutions
  • Groups participating in the project
  • Abstract (2 pages maximum)

For demonstrations to be accepted, the following information is mandatory:

  • Demo title
  • Name, affiliation, e-mail and phone number of the authors
  • Abstract (2 pages maximum)
  • Time estimation for the whole presentation


  • April 24, 2009: Deadline for submitting papers, projects and demos
  • May 25, 2009: Notification of acceptance
  • June 19, 2009: Deadline for submitting the final version
  • July 15, 2009: Deadline for early registration
  • Sept. 7, 2009: Workshops
  • Sept. 8, 9 & 10: 25th SEPLN Conference


Chairman: Kepa Sarasola (Euskal Herriko Unibertsitatea)


* Itziar Aduriz (Universitat de Barcelona)
* José Gabriel Amores (Universidad de Sevilla)
* Jose Maria Arriola (Euskal Herriko Unibertsitatea)
* Xabier Artola (Euskal Herriko Unibertsitatea)
* Toni Badía (Universitat Pompeu Fabra)
* Manuel de Buenaga (Universidad Europea de Madrid)
* Irene Castellón (Universitat de Barcelona)
* Arantza Díaz de Ilarraza (Euskal Herriko Unibertsitatea)
* Víctor Díaz Madrigal (Universidad de Sevilla)
* Antonio Ferrández (Universitat d'Alacant)
* Mikel Forcada (Universitat d'Alacant)
* Ana García-Serrano (Universidad Politécnica de Madrid)
* Alexander Gelbukh (Instituto Politécnico Nacional. México)
* Koldo Gojenola (Euskal Herriko Unibertsitatea)
* Xavier Gómez Guinovart (Universidade de Vigo)
* Julio Gonzalo (UNED)
* José Miguel Goñi (Universidad Politécnica de Madrid)
* José Carlos González (Universidad Politécnica de Madrid)
* Montserrat Marichalar (Euskal Herriko Unibertsitatea)
* José Mariño (Universitat Politècnica de Catalunya)
* M. Antonia Martí (Universitat de Barcelona)
* María Teresa Martín (Universidad de Jaén)
* Patricio Martínez (Universitat d'Alacant)
* Paloma Martínez (Universidad Carlos III, Madrid)
* Raquel Martínez (UNED)
* Ruslan Mitkov (Universidad de Wolverhampton)
* Manuel Montes y Gómez (Instituto Nacional de Astrofísica, Óptica y
Electrónica. México)
* Lidia Moreno (Universitat Politècnica de València)
* Lluís Padró (Universitat Politècnica de Catalunya)
* Ramón López Cózar (Universidad de Granada)
* Manuel Palomar (Universitat d'Alacant)
* Ferrán Pla (Universitat Politècnica de València)
* German Rigau (Euskal Herriko Unibertsitatea)
* Horacio Rodríguez (Universitat Politècnica de Catalunya)
* Leonel Ruiz Miyares (Centro de Lingüística Aplicada de Santiago de
* Emilio Sanchís (Universitat Politècnica de València)
* Kepa Sarasola (Euskal Herriko Unibertsitatea)

* Mariona Taulé (Universitat de Barcelona)
* L. Alfonso Ureña (Universidad de Jaén)
* Felisa Verdejo (UNED)
* Manuel Vilares (Universidad de A Coruña)
* Luis Villaseñor-Pineda (Instituto Nacional de Astrofísica, Óptica y
Electrónica. México)


All the information about the Conference is available in the 25th SEPLN Conference website: E-mail:

CFP: 21st International Joint Conference on Artificial Intelligence (IJCAI-09)

. Monday, December 08, 2008

The IJCAI-09 Program Committee invites submissions of technical papers for IJCAI-09, to be held in Pasadena, CA, USA, July 11-17, 2009. Submissions are invited on significant, original, and previously unpublished research on all aspects of artificial intelligence.

The theme of IJCAI-09 is "The Interdisciplinary Reach of Artificial Intelligence," with a focus on the broad impact of artificial intelligence on science, engineering, medicine, social sciences, arts and humanities. The conference will include invited talks, workshops, tutorials, and other events dedicated to this theme.
  • Important dates for authors of technical papers:
  • Electronic abstract submission: January 7, 2009 (11:59PM, PST)
  • Electronic paper submission: January 12, 2009 (11:59PM, PST)
  • Author feedback period: March 13-16, 2009 (11:59PM, PDT). Please note: Daylight savings time starts on March 8.
  • Author notification of acceptance/rejection: March 31, 2009
  • Camera-ready copy due: April 14, 2009
  • Technical sessions: July 13-17, 2009

Submission Details

Submitted papers must be formatted according to IJCAI guidelines and submitted electronically through the IJCAI-09 paper submission site. Full instructions for submission, including formatting guidelines and electronic templates for paper submission, are available on the IJCAI-09 website: (see the link titled Submission Details). Submitting authors will be required to register with the IJCAI-09 paper submission software (this will be linked from the IJCAI-09 website during the first week of December, 2008).

Papers may be accepted for either oral or poster presentation; papers accepted for either form of presentation will not be distinguished in the conference proceedings, nor will designation of oral or poster presentation be made on the quality of the contribution. Instead, these distinctions will be made in the interests of overall program coherence and quality.

To facilitate review, the paper title, author names, contact details, and a brief abstract must be submitted electronically by Jan. 7, 2009 (11:59 PST). No paper will be accepted for review unless an accompanying abstract is received by the deadline. Technical papers are due electronically on Jan. 12, 2009 (11:59 PST). Authors bear full responsibility for compliance with submission standards. Submissions received after the deadline or that do not meet the length or formatting requirements will not be accepted for review. No email or fax submissions will be accepted. Notification of receipt of the electronically submitted papers will be emailed to the designated contact author soon after receipt. If there are problems with the electronic submission, the program chair will contact the designated author by email. The last day for inquiries regarding lost submissions is Jan. 19, 2009. Notification of acceptance or rejection of submitted papers will be emailed to the designated author by March 31, 2009. The opportunity to respond to preliminary reviews will be made available to authors prior to this date, during the period March 13-16, 2009.

Guidelines for such responses, along with details of the reviewing process will be posted on the IJCAI-09 website. Camera-ready copy of accepted papers must be received by the publisher by April 14, 2009. Note: at least one author of each accepted paper is required to attend the conference to present the work. Authors will be required to confirm their acceptance of this requirement at the time of submission.

Authors who do not have access to the web should contact the program chair at no later than December 15, 2008 for alternate submission instructions.

Content Areas

To facilitate the reviewing process, authors will be required to choose two to four appropriate content area keywords from the list provided by the IJCAI-09 submission software, which will be part of the online paper registration process. Authors are encouraged to select the most specific keywords that accurately describe the main aspects of their contributions. General categories should only be used if specific categories do not apply or do not accurately reflect the main contributions. Each keyword is placed within one of ten 10 major themes; however, many of the keywords cut across multiple themes, and authors should feel free to select any keyword descriptive of the contribution, even if the major theme within which is it categorized is not the most appropriate. A list of keywords is appended to the end of this call.

The major themes are:

Agent-based and Multi-agent Systems
Constraints, Satisfiability, and Search
Knowledge Representation, Reasoning and Logic
Machine Learning
Multidisciplinary Topics And Applications
Natural Language Processing
Planning and Scheduling
Robotics and Vision
Uncertainty in AI
Web and Knowledge-based Information Systems

Policy on Multiple Submissions

IJCAI will not accept any paper which, at the time of submission, is under review for or has already been published or accepted for publication in a journal or another conference. Authors are also required not to submit their papers elsewhere during IJCAI's review period. These restrictions apply only to journals and conferences, not to workshops and similar specialized presentations with a limited audience and without archival proceedings. Authors will be required to confirm that their submissions conform to these requirements at the time
of submission.

Paper Length and Format

Submitted technical papers must be no longer than six pages, including all figures and references, and must be formatted according to posted IJCAI-09 guidelines. Specifically, papers must be formatted for "letter-size" (8.5" x 11") paper, in double-column format with a 10pt font. Electronic templates for the LaTeX typesetting package, as well as a Word template, that conform to IJCAI-09 guidelines will be made available at the conference website (see above) during the first week of December, as will further details on formatting.

Authors are required to submit their electronic papers in PDF format. Files in Postscript (ps), or any other format will not be accepted.

Submitted papers must not exceed six (6) formatted pages, including references and figures. This six-page limit will be strictly enforced: over-length papers will not be considered for review. Each accepted paper will be allowed six pages in the proceedings; up to two additional pages may be purchased at a price of $275 per page. In order to make blind reviewing possible, authors must omit their names and affiliations from the paper. Also, while the references should include all published literature relevant to the paper, including previous works of the authors, it should not include unpublished works. When referring to one's own work, use the third person rather than the first person. For example, say "Previously, Foo and Bar [7] have shown that...", rather than "In our previous work [7] we have shown that..." For accepted papers, such identifying information can be added to the final camera-ready version for publication.

Review Process

Papers will be subject to blind peer review. Selection criteria include accuracy and originality of ideas, clarity and significance of results and quality of the presentation. Each paper will be assigned to three Program Committee members, one Senior Program Committee member and one Area Chair for review. The reviewing process will include a short period for the authors to view reviews and respond to technical questions on the submitted work raised by the reviewers before final decisions are made. The decision of the Program Committee will be final and cannot be appealed.

Papers accepted for the conference will be scheduled for oral or poster presentation and will be printed in the proceedings. At least one author of each accepted paper will be required to attend the conference to present the work.

Please send inquiries about paper submissions to

Inquiries about the conference program can be directed to:

Craig Boutilier
Program Chair, IJCAI-09
Department of Computer Science
University of Toronto
Toronto, ON, M5S 3H5, CANADA

For further information please visit the conference web site:

List of keywords:

Agent-based and Multi-agent Systems
  • Agent/AI Theories and Architectures
  • Agent-based Simulation and Emergent Behavior
  • Agent Communication
  • Argumentation
  • Auctions And Market-Based Systems
  • Coordination And Collaboration
  • Distributed AI
  • E-Commerce
  • Game Theory
  • Information/Mobile/Software Agents
  • Multiagent Learning
  • Multiagent Planning
  • Multiagent Systems (General/other)
  • Negotiation And Contract-Based Systems
  • Social Choice Theory

Constraints, Satisfiability, and Search
  • Applications
  • Constraint Optimization
  • Constraint Satisfaction (General/other)
  • Distributed Search/CSP/Optimization
  • Dynamic Programming
  • Search, SAT, CSP: Evaluation and Analysis
  • Global Constraints
  • Heuristic Search
  • Search, SAT, CSP: Meta-heuristics
  • Meta-Reasoning
  • Quantifier Formulations
  • Satisfiability (General/other)
  • SAT and CSP: Modeling/Formulations
  • Search (General/other)
  • SAT and CSP: Solvers and Tools

Knowledge Representation, Reasoning and Logic
  • Action, Change and Causality
  • Automated Reasoning and Theorem Proving
  • Belief Change
  • Common-Sense Reasoning
  • Computational Complexity of Reasoning
  • Description Logics and Ontologies
  • Diagnosis and Abductive Reasoning
  • Geometric, Spatial, and Temporal Reasoning
  • Knowledge Representation Languages
  • Knowledge Representation (General/other)
  • Logic Programming
  • Many-Valued And Fuzzy Logics
  • Nonmonotonic Reasoning
  • Preferences
  • Qualitative Reasoning
  • Reasoning with Beliefs

Machine Learning
  • Active Learning
  • Case-based Reasoning
  • Classification
  • Cost-Sensitive Learning
  • Data Mining
  • Ensemble Methods
  • Evolutionary Computation
  • Feature Selection/Construction
  • Kernel Methods
  • Learning Graphical Models
  • Learning Preferences/Rankings
  • Learning Theory
  • Machine Learning (General/other)
  • Neural Networks
  • Online Learning
  • Reinforcement Learning
  • Relational Learning
  • Time-series/Data Streams
  • Transfer, Adaptation, Multi-task Learning
  • Semi-Supervised/Unsupervised Learning
  • Structured Learning

Multidisciplinary Topics And Applications
  • AI and Natural Sciences
  • AI and Social Sciences
  • Art And Music
  • Autonomic Computing
  • Cognitive Modeling
  • Computational Biology
  • Computer Games
  • Computer-Aided Education
  • Database Systems
  • Philosophical and Ethical Issues
  • Human-Computer Interaction
  • Intelligent User Interfaces
  • Interactive Entertainment
  • Personalization and User Modeling
  • Real-Time Systems
  • Security and Privacy
  • Validation and Verification

Natural-Language Processing
  • Dialogue
  • Discourse
  • Information Extraction
  • Information Retrieval
  • Machine Translation
  • Morphology and Phonology
  • Natural Language Generation
  • Natural Language Semantics
  • Natural Language Summarization
  • Natural Language Syntax
  • Natural Language Processing (General/other)
  • Psycholinguistics
  • Question Answering
  • Speech Recognition And Understanding
  • Text Classification

Planning and Scheduling
  • Activity and Plan Recognition
  • Hybrid Systems
  • Markov Decisions Processes
  • Model-Based Reasoning
  • POMDPs
  • Plan Execution And Monitoring
  • Plan/Workflow Analysis
  • Planning Algorithms
  • Planning under Uncertainty
  • Planning (General/other)
  • Scheduling
  • Theoretical Foundations of Planning

Robotics and Vision
  • Behavior And Control
  • Cognitive Robotics
  • Human Robot Interaction
  • Localization, Mapping, State Estimation
  • Manipulation
  • Motion and Path Planning
  • Multi-Robot Systems
  • Robotics
  • Sensor Networks
  • Vision and Perception

Uncertainty in AI
  • Approximate Probabilistic Inference
  • Bayesian Networks
  • Decision/Utility Theory
  • Exact Probabilistic Inference
  • Graphical Models
  • Preference Elicitation
  • Sequential Decision Making
  • Uncertainty Representations
  • Uncertainty in AI (General/other)

Web and Knowledge-based Information Systems
  • Information Extraction
  • Information Integration
  • Information Retrieval
  • Knowledge Acquisition
  • Knowledge Engineering
  • Knowledge-based Systems (General/other)
  • Ontologies
  • Recommender Systems
  • Semantic Web
  • Social Networks
  • Source Wrapping
  • Web Mining
  • Web Search
  • Web Technologies (General/other)

Call for ICML/UAI/COLT 2009 Workshop Proposals

. Monday, December 01, 2008

The ICML, UAI, and COLT conferences will be colocated in Montreal June 14-21 2009. We solict proposals for workshops to be held during a single joint workshop day on June 18. This date lies between ICML (June 14-17) and UAI/COLT (June 19-21). Workshops will be selected on the basis of their interest to the attendees of one or more of the conferences.

The goal of the workshops is to provide an informal forum for researchers to discuss important research questions and challenges. Controversial issues, open problems, and comparisons of competing approaches are encouraged. Representation of alternative viewpoints and panel-style discussions are also encouraged.


The format, style, and content of accepted workshops is under the control of the workshop organizers and largely autonomous from the main conferences. The workshops will be seven hours long and split into morning and afternoon sessions. Workshop organizers will be expected to manage the workshop content, specify the workshop format, be present to moderate the discussion and panels, invite experts in the domain, and maintain a website for the workshop. Workshop registration will be handled centrally by the main conferences with a single uniform registration fee and with registrants allowed to attend workshops other than the one they register for.

Submission Instructions

Proposals should specify clearly all of the following:

  • the workshop's title (what is it called?)
  • topic (what is it about?)
  • motivation (why a workshop on this topic?)
  • impact and expected outcomes (what will having the workshop do?)
  • potential invited speakers (who might come?)
  • a list of related publications (where can we learn more?)
  • main workshop organizer (who is making it happen?)
  • other organizers (who else is making it happen?)
  • workshop URL (where will interested parties get more information?)
  • relevant conferences (which of ICML, UAI, and COLT would it appeal to?)

Please also provide brief CVs of all organizers. This information should be sent by email (in plain text or pdf format) to by 19 Jan 2009.

16 PhD Scholarships in bioinformatics and robotics


I just published in "Computer Science PhD" 16 PhD Scholarships for working on 16 individual projects in the fields of bioinformatics and robotics at the Graduate School for Computing in Medicine and Life Sciences (Germany). PhD scholarships amount to 1250 € per month. Students with a master's degree (or its equivalent) in computer science, mathematics, or engineering are invited to apply for admission. The application deadline is January 15, 2009.

ICML 2008 Call for Papers

. Thursday, November 27, 2008

The 26th International Conference On Machine Learning (ICML-2009)
June 14-18, 2009, Montreal, Canada

This call for papers extends the preliminary call by including the conference website,, and the list of area chairs and topic descriptors, . Please browse the list of area chairs to get a sense of the scope and coverage of this year's conference. We encourage a broad range of submissions!

ICML 2009 invites submission of engagingly written papers on substantial, original, and previously unpublished research in *all* aspects of machine learning. We welcome submissions of innovative work on systems that are self adaptive, systems that improve their own performance, or systems that apply logical, statistical, probabilistic or other formalisms to the analysis of data, to the learning of predictive models, or to interaction with the environment. We welcome innovative applications, theoretical contributions, carefully evaluated empirical studies, and we particularly welcome work that combines all of these elements. We also encourage submissions that bridge the gap between machine learning and other fields of research. ICML 2009 will be held in Montreal, Canada, June 14-18, 2009, and will be co-located with the Uncertainty in Artificial Intelligence Conference (UAI), and the Conference on Learning Theory (COLT), and Multidisciplinary Symposium on Reinforcement Learning (MSRL).

DATES (Note slightly earlier schedule than 2008):

  • January 26: Full paper submissions due (no separate abstract date)
  • February 27: First round reviews available
  • March 10: Author responses due
  • April 6: Acceptance notification
  • April 20: Final camera-ready version due
  • June 14: ICML Tutorials
  • June 15-17 ICML Conference
  • June 18: Joint Workshops Day, ICML/UAI/COLT; MSRL
Format of the Conference

The conference will include three days of technical presentations, one day of tutorials and one day of workshops. Accepted papers will each have an oral presentation as well as a poster in an evening poster session. There will also be talks by several invited speakers and a banquet.


Awards will be given for Best Paper(s), Best Student Paper(s) (first-authored by a student), Best Application Paper, 10-year Best Paper (most influential paper of ICML 1999).


Submission format, details and style files will soon be available on the ICML 2009 website ( Submission of papers and the management of the paper reviewing process will be entirely electronic.

Review Process (New for 2009!)

Our review process this year will be slightly different from previous years to further encourage innovative papers on a variety of topics. Authors will indicate a preference for an area chair to handle their papers via an inverse bidding process. It is crucial for authors to familiarize themselves with the 2009 area chairs and their topic descriptiors ( The goal is to ensure each submission is considered by reviewers appropriate to the paper's intended contribution. Each submitted paper will receive two first round reviews. As in recent years, authors will have the opportunity to see and respond to the reviews before a final decision is made. Papers that receive at least one positive review in the first round will receive one or more additional reviews. Final decisions will be made using the input from all reviewers, the author feedback, the assigned area chair, and programme co-chairs. Reviewing for ICML 2009 will be blind to the identities of the authors. No conditional accepts will be granted this year.

ICML 2009 will not accept any paper that is substantially similar to another paper that is currently under review or has already been accepted for publication in a journal or another conference. The programme co-chairs will consider making an exception for papers published in substantially disjoint communities (application conferences, for example), as long as the submitted papers are themselves clearly targeted to a machine-learning audience. Please clearly indicate which contributions are novel and which are previous work, either by the authors or others. If a paper submitted to ICML 2009 and another already published or already submitted paper contain substantial overlap in content and the content is not clearly indicated (anonymously) as being previous work, then the ICML submission may be rejected on the grounds of being a dual submission.

Similarly, authors must withdraw their papers if they submit an overlapping paper elsewhere during ICML's review period.

With your help, we expect another excellent conference!

-The ICML2009 Organizational Team

General Chair:
Andrea Danyluk (Williams College)
Programme co-chairs:
Leon Bottou (NEC Research)
Michael Littman (Rutgers University)
Local Arrangements Chair:
Doina Precup (McGill University)

On conferences

. Wednesday, November 12, 2008

Just a little reflection, why CS researchers spend so much money attending conferences? We attended CIKM and IDEAL recently and taking into account the costs of the travel and the limited audience, I can't understand why the conferences are so relevant in our area. Too much money expended for almost nothing, when we could organize ourselves with journals or even with virtual conferences that do not require physical attendance.

In particular, at IDEAL, in our session, from 4 expected presentations, only 2 of us (in our case Frankie presented the paper) presented their papers, and there were only 6 people in the room. Travel to Corea for that seems pretty ridiculous.

Frankie@Ideal 2008

What do you think about this particular topic?

Our presentation in IDEAL 2008

. Monday, November 03, 2008

Tomorrow we will be presenting our paper in IDEAL 2008. We finished the presentation the other day and I've uploaded it to SlideShare in order to share it with you

Going to CIKM and IDEAL

. Saturday, October 25, 2008

Tomorrow I'll be flying (with my colleage Francisco Carrero) to San Francisco to attend CIKM 2008. Then, at the end of the next week, I'll fly to South Corea to attend IDEAL 2008. 12 days off the office to attend some interesting sessiones and visit interesting places. A complete travel around the world :D

In the Development of a Spanish MetaMap

. Tuesday, October 21, 2008

Frankie and me will be attending CIKM this year, in order to present a poster on one of our current research lines. This poster is entitled "In the Development of a Spanish MetaMap" and presents how we are trying to deal with the adaption of a such huge linguistic resource as MetaMap is.

Next is the poster, please, feel free to make any comments about the poster, as we prefer to correct or arrange anything before CIKM in order to make it clearer.

Poster@CIKM 2008


. Monday, October 20, 2008

This morning, I found an interesting post about Yahoo! Open Strategy (Y!OS) in Enrique Puertas' blog (in Spanish). Y!OS is the way Yahoo! is trying tyo fight against its major competence, Google. Both companies knows about the relevance of Open Software and are adopting Open Strategies in order to create comunities around their products.

Y!OS is organized in 3 main platforms plus OAuth as a way to implement a model of authentication.

1.- Yahoo! Application Platform: Is the Yahoo! platform for developing web applications that are available throughout Yahoo!. It gives the developers a development environment, APIs to access important functionalities, distribution and discovery infrastructure and a runtime and rendering environment.

In the next video, Xavier Legros talks about Yahoo! Application Platform at Open Hack Day 2008.

2.- Yahoo! Social Platform: is a suite of REST APIs that enable the creation of social applications that makes easier to connect users.

3.- Yahoo! Query Language: is a similar to SQL language that allows the developers to query, filter anc combine data accross Yahoo!, as well as any othe sources like RSS feed or HTML webpages.

Is seems really interesting that, from several services like BOSS or PIPES, Yahoo! has been able to develop a complete and unified platform that gives the developer an added value. I like the last movements of Yahoo! and I think Yahoo! only depends on himself to continue/return being a great player of Internet's technologies.

The State of Business Intelligence 2008

. Sunday, October 19, 2008

InformationWeek publishes an interesting article about the state of Business Intelligence 2008. This last year has been a very moved one in this area, with great adquisitions (Cognos, BO, Hyperion, etc.) by big companies (Oracle, SAP, Microsoft, etc.) that changes the actual panorama of this sector.

Second Web People Search Evaluation Workshop

. Tuesday, October 07, 2008

Second Web People Search Evaluation Workshop
Call for Participation

Finding information about people in the World Wide Web is one of the most common activities of Internet users. Person names, however, are highly ambiguous. In most cases, the results for a person name search are a mix of pages about different people sharing the same name. The user is then forced either to add terms to the query (probably losing recall and focusing on one single aspect of the person), or to browse every document in order to filter the information about the person he/she is actually looking for. In an ideal system the user would simply type a person name, and receive search results clustered according to the different people sharing that name.

In 2007 the Web People Search Task (Artiles et al. 2007) was the first competitive evaluation focused on this problem. The 16 participating systems received a set of web pages for a person name, and they had to cluster them into different entities. This second evaluation provides a new testbed corpus, improved evaluation metrics, and an additional attribute extraction subtask.

* Task definitions

** Clustering

In this task systems receive as input a set of web search results obtained when performing a query for an (ambiguous) person name. The expected output is a clustering of the web pages, where each cluster is assumed to contain all (and only those) pages that refer to the same individual.

** Attribute Extraction

This subtask consists of extracting 18 kinds of "attribute values" for target individuals whose names appear on each of the provided Web pages. The organizers will distribute the target Web pages in their original format (i.e., html), and the participant systems have to extract attribute values from each page.

** Complete guidelines and data

* Participation

The clustering and the attribute extraction task will be regarded as two separate subtasks, and therefore a team can choose to participate in only one or both of them. The organizers will provide annotated data for developing/training systems. On a second stage, an unannotated corpus will be distributed, systems output will be collected and evaluation results returned to the participants. Each team can submit up to five runs. Every team is expected to write a paper describing their system and discussing the evaluation results.

* How do I register ?

Please send an email expressing your interest to the task organizers (

* Important Dates

  • October 2008: Distribute the training data + CFP
  • December 1-8, 2008: Evaluation
  • December 17, 2008: Return the evaluation result
  • February 2009: Papers due.
  • April 2x, 2009: Workshop in Madrid.

* Workshop Organizers

  • Satoshi Sekine, Proteus Project (NYU).
  • Javier Artiles, NLP & IR Group (UNED).
  • Julio Gonzalo, NLP & IR Group (UNED).

* Program Committee

  • Eneko Agirre, UBC
  • Breck Balwin, Alias-i
  • Andrew Borthwick, Spock
  • Jeremy Ellman, Northumbria University
  • Donna Harman, National Institute of Standards and Technology (NIST)
  • Eduard Hovy, ISI
  • Dmitri Kalashnikov, University of California, Irvine
  • Paul Kalmar, Fair Issac
  • Bernardo Magnini, FBK-irst, Italy
  • Gideon Mann, Google
  • Yutaka Matsuo, Tokyo University
  • Manabu Okumura, Tokyo Inst. of Tech.
  • Ted Pedersen, University of Minnesota
  • Massimo Poesio, University of Essex
  • Maarten de Rijke, University of Amsterdam
  • Mark Sanderson, University of Sheffield
  • Arjen P. de Vries, Centrum Wiskunde & Informatica
Updated information about the task can be found at the WePS web site (

Asimov in Modern Science

. Monday, September 22, 2008

Today I've read about PHRIENDS project. Funded by EU (2.16 € million), PHRIENDS tries to force robots to respect Asimov's laws. Asimov seems to be a great scientific since even his most futuristic ideas are influencing the actual science.

I spent a great part of my adolescence reading Asimov's books and dreaming about intelligent robots that make use of the 3 laws. Now, it seems that, sometime, I could even work with a robot that implements those laws and that is so cool... :D

P.D: For Spanish speakers, I've posted a larger post about the project in my spanish scientific blog.

CFPS on Social Networks

. Thursday, September 11, 2008

In recent months, I've been developing software for my new company, a social network for videogamers, Wipley. That has awaken interests on all the stuff related to Social Network analysis. Last days, I've received a couple of call for papers related to this topic, that seems very interesting:

Workshop on Machine Learning Open Source 2008

. Wednesday, September 10, 2008

I like Open Access and Open Software, in fact, I'm meber of the local LUG of my University (GLUEM) and some of my posts are refered to these topics. In ML-News list, I've seen a call for submissiones for the a Workshop on Machine Learning Open Source (MLOSS), that will be held at NIPS, December 12th. I this this kind of workshops are a very good idea to promote the use of Open Software in ML, and give extra benefits to those developers that let the community use their software, allowing other researchers a faster development of their experiments.

The NIPS workshop on Workshop on Machine Learning Open Source Software (MLOSS) will held in Whistler (B.C.) on the 12th of December, 2008.

Important Dates

* Submission Date: October 1st, 2008
* Notification of Acceptance: October 14th, 2008
* Workshop date: December 12 or 13th, 2008

Call for Contributions

The organizing committee is currently seeking abstracts for talks at MLOSS 2008. MLOSS is a great opportunity for you to tell the community about your use, development, or philosophy of open source software in machine learning. This includes (but is not limited to) numeric packages (as e.g. R,octave,numpy), machine learning toolboxes and implementations of ML-algorithms. The committee will select several submitted abstracts for 20-minute talks. The submission process is very simple:

* Tag your project with the tag nips2008

* Ensure that you have a good description (limited to 500 words)

* Any bells and whistles can be put on your own project page, and of course provide this link on

On 1 October 2008, we will collect all projects tagged with nips2008 for review.

Note: Projects must adhere to a recognized Open Source License (cf. ) and the source code must have been released at the time of submission. Submissions will be reviewed based on the status of the project at the time of the
submission deadline.


We believe that the wide-spread adoption of open source software policies will have a tremendous impact on the field of machine learning. The goal of this workshop is to further support the current developments in this area and give new impulses to it. Following the success of the inaugural NIPS-MLOSS workshop held at NIPS 2006, the Journal of Machine Learning Research (JMLR) has started a new track for machine learning open source software initiated by the workshop's organizers. Many prominent machine learning researchers have co-authored a position paper advocating the need for open source software in machine learning. Furthermore, the workshop's organizers have set up a community website where people can register
their software projects, rate existing projects and initiate discussions about projects and related topics. This website currently lists 123 such projects including many prominent projects in the area of machine learning.

The main goal of this workshop is to bring the main practitioners in the area of machine learning open source software together in order to initiate processes which will help to further improve the development of this area. In particular, we have to move beyond a mere collection of more or less unrelated software projects and provide a common foundation to stimulate cooperation and interoperability between different projects. An important step in this direction will be a common data exchange format such that different methods can exchange their results more easily.

This year's workshop sessions will consist of three parts.

* We have two invited speakers: John Eaton, the lead developer of Octave and John Hunter, the lead developer of matplotlib.

* Researchers are invited to submit their open source project to present it at the workshop.

* In discussion sessions, important questions regarding the future development of this area will be discussed. In particular, we will discuss what makes a good machine learning software project and how to improve interoperability between programs. In addition, the question of how to deal with data sets and reproducibility will also be addressed.

Taking advantage of the large number of key research groups which attend NIPS, decisions and agreements taken at the workshop will have the potential to significantly impact the future of machine learning software.

Invited Speakers

* John D. Hunter - Main author of matplotlib.

* John W. Eaton - Main author of Octave.

Tentative Program

The 1 day workshop will be a mixture of talks (including a mandatory demo of the software) and panel/open/hands-on discussions.

Morning session: 7:30am - 10:30am

* Introduction and overview
* Octave (John W. Eaton)
* Contributed Talks
* Discussion: What is a good mloss project?
o Review criteria for JMLR mloss
o Interoperable software
o Test suites

Afternoon session: 3:30pm - 6:30pm

* Matplotlib (John D. Hunter)
* Contributed Talks
* Discussion: Reproducible research
o Data exchange standards
o Shall datasets be open too? How to provide access to data sets.
o Reproducible research, the next level after UCI datasets.

Program Committee

* Jason Weston (NEC Princeton, USA)
* Gunnar Rätsch (FML Tuebingen, Germany)
* Lieven Vandenberghe (University of California LA, USA)
* Joachim Dahl (Aalborg University, Denmark)
* Torsten Hothorn (Ludwig Maximilians University, Munich, Germany)
* Asa Ben-Hur (Colorado State University, USA)
* William Stafford Noble (Department of Genome Sciences Seattle, USA)
* Klaus-Robert Mueller (Fraunhofer Institute First, Germany)
* Geoff Holmes (University of Waikato, New Zealand)
* Alain Rakotomamonjy (University of Rouen, France)


* Soeren Sonnenburg
Fraunhofer FIRST Kekuléstr. 7, 12489 Berlin, Germany

* Mikio Braun
Technische Universität Berlin, Franklinstr. 28/29, FR 6-9, 10587
Berlin, Germany

* Cheng Soon Ong
ETH Zürich, Universitätstr. 6, 8092 Zürich, Switzerland


The workshop is supported by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning)

Dashboards, pointing the way of Business Intelligence

. Monday, September 08, 2008

I use to read an interesting spanish blog related to BI world, called TodoBI, which pointed me to an article about the importance of dashboards in Business Intelligence, written by Tom Gonzalez. In this article, Tom exposes his vision about the future of Business Intelligence. Tom believes that BI should focus on dashboards, adopting a user-centric approach instead of a more data-centric one. In Tom's words

So where does that leave us today, and what does this all mean for the future of BI? I think dashboards represent just the first step for the next major phase in BI both from a technology and a methodology perspective. For lack of a better term I will label this next phase the "BI user experience" as represented by user interfaces that information workers and business executives interact with to "experience" their data [...] Your ability to process that information and the inherent relationships within that data is exponentially higher and faster with the bar chart. This is one area where the human brain still far exceeds the power of technology-driven computation in its ability to recognize and process patterns composed of large volumes of information.

I totally agree with Tom's vision, which fits in my vision of the connection between Machine Learning, Data Mining and Business Intelligence. For me, ML, DM and BI can be seen as 3 different areas, but they can also be seen as a chain where each one plays an important role. DM is data centric as it focuses on data, BI is user centric as it should deal with users needs and ML is the intelligence behind the process (althought not every need needs an intelligent process).

In the figure, ML is represented inside DM and DM inside BI. From the BI point of view, DM is like glacé cherry, a turn of the screw from the statistical processes behind BI. ML is inside DM as it is the engine for processing all the data in DM processes.

WWW Tracks

. Monday, September 01, 2008

This year, WWW Conference is held at 10 minutes from my work, at Universidad Europea de Madrid. There are several interesting tracks dealing with different aspects of the Web. The most interesting, for me, are the following tracks: "Data Mining", "Social Networks and Web 2.0", "Semantic/Data Web".

I knew about the CFP but the last time I visited the web there was no info about the tracks. A post in Hurst's blog, reminded me to refresh the info about the WWW Conference.

Nokia Workshop on Machine Consciousness

. Thursday, August 28, 2008

A good friend of mine, Raúl Arrabales, is attending the "Nokia Workshop on Machine Consciousness 2008", celebrated in conjunction with the Finnish AI Conference ath the Nokia Research Center in Helsinki. He has posted a personal summary of the workshop, in his blog about conscious robotics, that I think is interesting to read.

Data Mining Competition: Discovering Knowledge in NHANES Data

. Friday, August 22, 2008

The Knowledge Discovery and Data Mining Working Group of the American Medical Information Association (AMIA) is announcing its second annual data mining competition for the purpose of studying best practices related to knowledge discovery in health care data. This year’s data set is the National Health and Nutrition Examination Survey (NHANES). This is a juried, international data mining competition open to students of any subject or discipline. Four winning individuals or teams will be invited to present their results at the AMIA 2008 Annual Symposium. (No funds for travel or Symposium registration will be provided). Final submissions for the competition are due to the moderator by midnight, September 15, 2008, MDT. Winners of the contest will be invited to present their work at the AMIA Annual Symposium in a panel sponsored by the Knowledge Discovery and Data Mining Working Group. Winners will be selected and recognized by an international panel of judges associated with the KDDM-WG. Please visit for more details.


This contest is open to student members of the American Medical Informatics Association, planning to attend the AMIA 2008 Annual Symposium. However, students may hail from any subject or discipline. For information on joining AMIA, please visit: . The work can be completed by an individual or group, but only one individual will present at the symposium for a winning team.

Participation in the Contest

As a contestant, you are invited to produce meaningful information/ knowledge with knowledge discovery or data mining approaches of your choice, using the publicly available National Health and Nutrition Examination Survey (NHANES) data. The data itself is publicly accessible, and available here: The latest data release (as of this announcement) is 2005-2006, but there is no restriction on the specific data release to be used, and entries may utilize multiple data releases. A wide variety of data mining approaches are acceptable, including both supervised and unsupervised learning and the extraction of temporal association and precedence rules. Both applied and methodological entries are appropriate. Final submissions for the competition are due to the moderator by midnight, September 15, 2008, MDT (U.S. Mountain Daylight Time). Winners will be selected and recognized by an international panel of judges associated with the KDDM-WG. Winners of the contest will be invited to present their work at the AMIA Annual Symposium in a dedicated session. (No funds for travel or registration will be provided).

Contest Entry

Entries should consist of the following:
  1. A written report (paper), not to exceed a maximum of five (8.5 x 11 inch) pages, including:
    • An abstract of 125 - 150 words
    • Names, academic degree(s), affiliations, and locations (city, state, and country, if international) of all authors (advisers should be added as authors).
    • Content could include sections of introduction, methods, results, discussion, and conclusion, but is left to author discretion.
    • See the Submission Template (MS Word) for correct format. The (pdf) may be helpful.
  2. Name and address of student's training education program
  3. Advisor's name and contact information
  4. A joint statement, signed by the student and the advisor, that identifies the student's specific contribution to the work presented, and attests that the studend prepared the paper.

Entries must be submitted as .doc or .pdf files. Submit entry to the contest moderator via e-mail: The deadline is midnight, September 15, 2008, MDT.

Useful Links

PAKDD 2009

. Friday, August 08, 2008

The 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2009) is held on Bangkok, Thailand, 27-30 April 2009.

The 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-09) is a major international conference in the areas of data mining and knowledge discovery. It provides an international forum for researchers and industry practitioners to share their new ideas, original research results and practical development experiences from all KDD-related areas including data mining, data warehousing, machine learning, databases, statistics, knowledge acquisition and automatic scientific discovery, data visualization, causal induction and knowledge-based systems. The conference website is at

Important Dates:
  • 09 September 2008, Abstract Submission (CFP)
  • 16 September 2008, Paper Submission
  • 19 September 2008, Workshop Proposal (CF Workshop Proposal)
  • 26 September 2008, Workshop Notification
  • 17 November 2008, Tutorial Proposal (CF Tutorial Proposal)
  • 28 November 2008, Tutorial Notification
  • 08 December 2008, Author Notification
  • 09 January 2009, Camera Ready
  • 27-30 April 2009, Conference & Workshop

Workshop on Web Search Click Data 2009

. Thursday, August 07, 2008

Via Nihil Obstat Blog, I read about a interesting call for research proposals, Workshop on Web Search Click Data 2009, held in conjunction with WSDM09. The workshop is organized by Microsoft and Yahoo! Labs. The workshop aims to work with a MSN search log containing about 15 million queries, which is reeeeeeally interesting :)

More information in the post written by José María Gómez or in the workshop webpage.

Conference Rankings


One important process in Science is the publication of the results of research. Sometimes, happens that you have written an excelent paper (not to me, I use to work with a deadline in mind in order to work under pressure) but you don't know a good journal/conference to publish the results. JCR listings makes easier some decisions about publishing in journals but there isn't an equivalent indicator for conferences.

One of my Thesis advisors, Ana Iglesias, gave me some good advices in order to choose relevant conferences. "Translating" those advices into rules, a good conference is the one that
Do you use any other ranking based criteria in order to choose a good conference?

According to these rules, the conferences I've attended recently or I will attend soon, are:
  • CIKM: A in Core (OK), top 35% in Citeseer (NO), 0.90 in CS (OK) => OK
  • IDEAL: C in Core (NO), top 96.72% in Citeseer (NO) => NO
  • ECDM: => NO
And the conferences I have sent some papers:
  • ICDM: A+ in Core (OK), top 59.86% in Citeseer (OK), 0.73 in CS (NO) => OK
  • ICDIM: =>NO
Most of the results make sense, CIKM and ICDM are really interesing conferences, IDEAL is an interesting forum as combines information processing, data mining with more applied topics as bioinformatics and financial engineering, but is not a really cutting-edge conference. ECDM is not a good conference and it is logical that it doesn't appear in those rankings. The only conference where I disagree with the previous "rules" is ICDIM. I think ICDIM is a good conference but, maybe due to its short life, it doesn't appear on the rankings.

According to these rankings, the most interesting conferences on Data Mining and Machine Learning are:
  • Citeseer: ICML, IJCAI, KDD, AAAI, NIPS
  • CORE: It's not a real ranking, but it gives A+ to all the best conferences from citeseer and CS Ranking.

Google Translation + Flickr API = FlickrBabel

. Tuesday, July 22, 2008

I'm actually involved in the development of a new StartUp,, a videogamers social network. As I've been "playing" with a lot of web applications APIs, I've had some ideas about integrating some of them for creating something that could be useful.

The first application I've developed for Wipley is FlickrBabel, a simple application that improves the search for photographies in Flickr by means of automated translation (Google translation API) and query expansion in order to search (by means of Flickr API) for a more general query. This method can be very useful for many people, specially non-english speakers, as Flickr (and many other web applications) is more used by English speakers than Spanish ones or, at least, there are more photographies tagged and described in English than in Spanish.

As a simple practical example, if you search "girasol" (the Spanish translation of sunflower) in Flickr, you may get over 6,200 results. If you search for "sunflower", you get more than 187,714 results. If you speak some English, you should use English instead of Spanish for performing your queries in Flickr. There are many other cases where English queries does not work as well as in the previous example. For instance, if you search for "omelette", you'll get over 11,000 results, but the Spanish translation, "tortilla", will get almost 30.000 results. FlickrBabel helps us by automatically translating our queries and performing the queries in both languages (I'll extend the functionality to other languages very soon).

Now, I'm working on several ways to relate photographies to other ones by means of contextual analysis. The application is at a beta stage but I'll appreciate any possible feedback given as a reply to this post or as a reply to the post we wrote in the official Wipley blog :D

Post-Summer Conferences

. Saturday, July 19, 2008

After the summer, we (Franki and me) will be attending CIKM 2008 (in Napa Valley, California) and IDEAL 2008 (in Daejeon, Corea) presenting different parts of the work we are doing in SINAMED and ISIS projects.

Our ongoing research in this projects "is mainly focused on using biomedical concepts for cross-lingual text classification. In this context the use of concepts instead of bag of words representation allows us to face text classification tasks abstracting from the language". For cross-lingual text tasks, "we evaluate the possibility of combining automatic translation techniques with the use of biomedical ontologies to produce an English text that can be processed by MMTx", saving the efforts of developing a Spanish MetaMap.

I hope see you some of you in CIKM or IDEAL :)

KDD 2009 in Paris

. Wednesday, July 16, 2008

KDD comes to Europe, that's a great new for european dataminers :) From June 28 to July 1, KDD will be held at Paris. There are no key dates defined for the conference, but I suppose the Call for Papers will be by the end of January 2009.

JMLR: Workshop and Conference Proceedings

. Tuesday, July 15, 2008

The Journal of Machine Learning Research (JMLR) is one of the leading journals in Machine Learning. Ranked the 7th in "Computer Science, Artificial Intelligence" category from the JCR, its impact factor is 2.682.

Beyond the quality of the journal and the papers published there, JMLR has represented a great initiative as the first quality Open Access journal in the Machine Learning field. From two years ago until now, JMLR tries to innovate with new initiatives like the support to the development of Open Source Machine Learning software or the recent creation of a special "Conference and Workshop Proceedings" series that aims publishing the work presented at Machine Learning Workshops and Conferences in an Open Access manner. These series have a ISSN (1938-7228) and is described by JMLR as follows

The JMLR: Workshop and Conference Proceedings series is a new series aimed specifically at publishing work presented at workshops and conferences. Each volume is separately titled and associated with a particular workshop or conference and will be pulished online on the JMLR web site. Authors will retain copyright and individual volume editors are free to make additional hardcopy publishing arrangments, but JMLR will not produce hardcopies of these volumes.

AUC as Performance Metric in ML

. Friday, July 04, 2008

ROC analysis is a classic methodology from signal detection theory used to depict the tradeoff between hit rates and false alarm rates of classifiers (Egan 1975, Swets 2000). ROC graphs has also been commonly used on medical diagnosis for visualizing and analyzing the behavior of diagnostic systems (Swets 1998). Spackman (Spackman 1989) was one of the first machine learning researchers to show interest in using ROC curves. Since then, the interest of the machine learning community in ROC analysis has increased, due in part to the realization that simple classification accuracy is often a poor metric for measuring performance (Provost 1997, Provost 1998).

The ROC curve compares the classifier's performance accross the entire range of class distributions and error costs (Provost 1997, Provost 1998). A ROC curve is a two-dimensional representation of classifier performance, which can be useful to represent some characteristics of the classifiers, but makes difficult to compare versus other classifiers. A common method to transform ROC performance to a scalar value, that is easier to manage, consists on calculate the area under the ROC curve (AUC) (Fawcett 2005). As the ROC curve is represented in a unit square, the AUC value will always be between 0.0 and 1.0, being the best classifiers the ones with a higher AUC value. As random guessing produces the diagonal line between (0,0) and (1,1), which has an area of 0.5, no real classifier should have an AUC less than 0.5.

Fig. 1. Example of ROC graphs, figure extracted from (Fawcett 2005). Subfigure a shows the AUC of two different classifiers. Subfigure b compares the graph of a scoring classifier B, and a discrete simplification of the same classifier, A.

Figure 1a shows two ROC curves representing two classifiers, A and B. Classifier B obtains higher AUC than classifier A and, therefore, it is supposed to behave better. Figure 1b shows a comparison between a scoring classifier (B) and a binary version of this classifier (A). Classifier A represents the performance of B when it is used with a fixed threshold. Though they represent almost the same classifier, A's performance measured by AUC is inferior to B. As we have seen, it can not be generated a full ROC curve from a discrete classifier, resulting in a less accurate performance analysis. Regarding this problem, in this paper we focus on scoring classifiers, but there are some attempts to create scoring classifiers from discrete ones (Domingos 2000, Fawcett 2001).

Hand and Till (Hand2001) present a simple approach to calculating the AUC of a given classifier.


Performance Metrics

. Monday, June 30, 2008

Performance metrics are values calculated from the predictions of the classifiers that allow us to validate the classifier's model. Definitions of these performance metrics are usually calculated from a confusion matrix. The figure 1 shows a confusion matrix for a two-class problem, that serves as example for describing the basic performance metrics. In the figure
  • π0 denotes the a priori probability of class (+).
  • π1 denotes the a priori probability of class (-); π1 =1-π0
  • p0 denotes the proportion of times the classifier predicts class (+).
  • p1 denotes the proportion of times the classifier predicts class (-); p1=1-p0.
  • TP is the number of instances belonging to class (+) that the classifier has correctly classified as class (+).
  • TN is the number of instances belonging to class (-) that the classifier has correctly classified as class (-).
  • FP is the number of instances that, belonging to class (-), the classifier has classified as positive (+).
  • FN is the number of instances that, belonging to class (+), the classifier has classified as negative (-).

Fig. 1
: Confusion matrix that generates the needed values for standard performance metrics

The precision is the percentage of true positive instances from all the instances classified as positive by the classifier; precision=TP/(TP+FP).The accuracy is the percentage of correctly classified instances; accuracy=(TP+TN)/π1.There are other approximations to estimate the classifier's performance that are used when dealing with a large set of classes. One of those approaches is Fβ that tries to compensate the effect of no uniformity in the instances' distribution among the classes. Fβ is calculated as follows

Van Rijsbergen in (vanRijsbergen, 1979) states that Fβ measures the effectiveness of retrieval with respect to a user who attaches $\beta$ times as much importance to recall as precision. One of the most typical uses of Fβ is the harmonic mean of precision and recall, F1.Traditionally, evaluation metrics like recall, precision and Fβ have been largely used by the Information Retrieval community. Classification accuracy has been the standard performance estimator in Machine Learning for years. Recently, the area under the ROC (Receiver Operating Characteristics) curve, or simply AUC, traditionally used in medical diagnosis, has been proposed as an alternative measure for evaluating the predictive ability of learning algorithms.


The Need for Open Source Software in Machine Learning

. Wednesday, June 25, 2008

Reading Undirect Grad blog, I found an interesting paper about the need of more Open Software in Machine Learning. The abstract:
Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community.
I think this paper addresses a very interesting problem, not only for the ML community. As said in the paper, "Open Source model allows better reproducibility of the results, quicker detection errors, innovative applications, faster adoption of ML methods in other disciplines", but it also avoids a constant reinvention of the wheel, and is a fairer model because if most of the researchs are funded by public money, why should researchers stop the access to the code?

The same happens with publications. Open Access should be a neccesary condition for every public funded research. Luckily, there are several iniciatives all around the globe trying to spread the benefits of the Open Access model, as Harvard's addoption of Open Access or the support of the Comunidad de Madrid (a Spanish region) to several Open Access iniciatives (sorry for the link in Spanish).

In recent years, the ML community has improve in this aspects. We count on a very good Open Source ML framework as Weka, we have a top Open Access Journal as JMLR that also supports ML Open Source software and a very good Open Source software repository like MLOSS.

Automated Microarray Classification Challenge

. Tuesday, June 24, 2008

The diagnosis of cancer on the basis of gene expression profiles is well established, so much so that micro-array classification has become one of the classic applications of machine learning in
computational biology. The field has now reached the stage where a large scale evaluation exercise is warranted to determine the advantages and disadvantages of competing approaches. We have therefore organized a challenge for ICMLA'08, the aim of which is to determine the best fully automated approach to micro-array classification. An unusual feature of the competition is that instead of submitting predictions on test cases, the competitors submit a MATLAB implementation of their algorithm (R and Java interfaces are also in development), which is then tested off-line by the challenge organizers. This will test the true operational value of the method, in the hands of an end user who is not necessarily an expert in a given technique. The winner of the challenge will receive a free registration to ICMLA'08.

Further details and background information regarding the competition are available from the challenge website, If you have any questions, please feel free to contact the challenge organizers (

The results of the challenge will be presented at a special session at ICMLA'08. Competitors are encouraged to participate in the special session and are invited to submit a technical paper describing their technique. Submissions should be made electronically in PDF format using the central ICMLA'08 website. The deadline for submissions is June 15, 2008. All accepted papers must be presented by one of the authors in order to be published in the conference proceeding.

Important Dates

Challenge opens March 10, 2008
Challenge closes Julu 15, 2008
Paper submission due July 15, 2008
Notification of acceptance September 1,
Camera-ready papers & pre-registration October 1, 2008
ICMLA'08 conference December 11-13, 2008

Special Session Chair

Dr Wenjia Wang, University of East Anglia, Norwich, U.K.

Special Session Organizers

Dr Gavin Cawley, University of East Anglia, Norwich, U.K.
Dr Wenjia Wang, University of East Anglia, Norwich, U.K.
Mr Geoffrey Guile, University of East Anglia, Norwich, U.K.

KDD Cup 2008 and Workshop on Mining Medical Data


KDD Cup is the first and the oldest data mining competition, and is an integral part of the annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Based on data provided by Siemens Medical Solutions USA, this year's KDD Cup competition focuses on the early detection of breast cancer from X-ray images of the breast. We are looking forward to an interesting competition and your participation. We particularly encourage the participation of students.

There are 2 different parallel options for participating:
  1. Submit entries to the KDD Cup competition
  2. Paper submissions for the associated Workshop on MiningMedical Data

Further details on each option are provided below.

KDD Cup 2008

Siemens Medical Solutions is proud to provide the data for the KDD Cup 2008 competition. The competition focuses on the early detection of breast cancer from X-ray images of the breast. There are two specific tasks, selected to be interesting to participants from academia and industry. The tasks are described in detail at You can choose to compete in either or both of the tasks. The training data can be downloaded after April 3, 2008. Important dates are listed below.

April 1 Web site up. Registration opens
April 3 Training data and evaluation code available after login
June 2 Test data available for download after login
June 20 Registration for KDD Cup closes
July 7 Last date for submission of results on test set
July 15 Notification of KDD Cup competition results
July 31 Winners submit their camera ready papers to the workshop
August 24-27 Winners present their work at the workshop.

Workshop on Mining Medical Data

We invite the submission of papers related to mining medical data. Participants in the KDD Cup 2008 may optionally submit papers to this workshop describing their entry. However, the workshop is broader in scope, and we also welcome other submissions related to the mining of
medical data from structured sources such as structured databases and from unstructured data sources such as medical images, textual notes, etc. We particularly invite papers describing systems that are able to combine all available patient information whether from structured sources or from unstructured sources, to support medical decision making.

All submitted papers will be evaluated by the workshop program committee based on scientific merits and novelty as perceived by the committee. Accepted papers will appear in the workshop proceedings. Authors of the accepted papers are required to present their papers at the workshop. Depending on interest, a subset of the selected papers may also be published in a special issue of a journal later on. Important dates are listed below.

All submitted papers must be in PDF format, must be restricted to 4 pages, and must use the template found at

July 7 Last date for submitting papers for the workshop
July 28 Author Notification about Accepted papers
July 31 Final Camera ready papers due
August 24-27 Authors of accepted papers present their work.

Usama Fayyad quits Yahoo


Before joining Yahoo!, Dr. Usama Fayyad worked 5 years in Microsoft Research and building data mining solutions for Microsoft's servers division. From 1989 to 1996, Usama held a leadership role at NASA's Jet Propulsion Laboratory (JPL) where. In 2000, he co-founded and served as CEO of digiMine Inc. (now Revenue Science Inc.), a data analysis and data mining company.

Dr. Fayyad has been in Yahoo! for more than 4 years, being chief data officer and executive vice president of research and strategic data solutions. From that position , Fayyad has been the responsible for Yahoo!'s overall data strategy, the Yahoo!'s architecting data policies and systems, and the manager of Yahoo!'s data analytics and data processing infrastructure.

On June 12, New York Times Bits reported that
Mr. Fayyad told his staff yesterday that he would be leaving and his departure is expected to be officially announced later today. Mr. Fayyad was the data guru at Yahoo, the person in charge of mining the terabytes of data collected by the company to improve things like the targeting of ads and content to Yahoo users. He was also in charge of Yahoo’s well-respected research organization.
Gregory Piatetsky-Shapiro reported in KDnuggets some interesting words from Usama Fayyad, where he says it is a good time to quit Yahoo! as his team will be able to continue his work. Usama seems to want starting a new company taking advantage of his data mining knowledges and the huge vision about Internet, search, advertising and the future of interactive media that Yahoo! has offered to him.

With this announcement, Usama joins to many other Yahoo! execs that are actually trying to "run away" from Yahoo!.

Computational Linguistics (CL) goes Open Access

. Thursday, June 19, 2008

Hal announces that CL journal would be open access from the first issue of the next year. There will be no print version of the journal and the electronic version will be Open Access.

The existence of an importan Open Access journal related to Computational Linguistics has been a discussion topic last years. On May 2007, Hal published the post "Whence JCLR?" where he discussed about the existence of the JMLR Journal, an Open Access Machine Learning journal that is one of the key journals for the ML community.

It is really a very good new for the CL community.

The Discipline of Machine Learning

. Tuesday, June 17, 2008

Tom Mitchell is one of the key personalities of Machine Learning discipline. He has been working in this area since the end of the 70's, published some reference ML textbooks and, first of all, he is the head of the first Machine Learning department all around the world.

In 2006, when he was "fighting" for the creation of the ML department at the Carnegie Mellon University, he was said that "you can only have a department if you have a discipline that is going to be here in one hundred years otherwise you can not have a department". For stating that ML would last more that a hundred years, he wrote a white paper, "The Discipline of Machine Learning", that is a real must-read paper for all the people interested in ML. The abstract of the paper states

Over the past 50 years the study of Machine Learning has grown from the efforts of a handful of computer engineers exploring whether computers could learn to play games, and a field of Statistics that largely ignored computational considerations, to a broad discipline that has produced fundamental statistical-computational theories of learning processes, has designed learning algorithms that are routinely used in commercial systems for speech recognition, computer vision, and a variety of other tasks, and has spun off an industry in data mining to discover hidden regularities in the growing volumes of online data. This document provides a brief and personal view of the discipline that has emerged as Machine Learning, the fundamental questions it addresses, its relationship to other sciences and society, and where it might be headed.

Tom also gave a speech related to this matter at the Carnegie Mellon University School of Computer Science's Machine Learning Department in March 2007. You can watch Mitchell's speech in this video.

ECML PKDD Discovery Challenge 2008

. Monday, June 16, 2008

This year, the ECML/PKDD's discovery challenge is set about social bookmarking. There are two main tasks: Spam Detection in Social Bookmarking Systems and Tag Recommendation in Social Bookmark Systems. This challenge is organized in conjunction with the Web 2.0 Mining workshop, and seems very interesting. Test data set will be released on July 30th, there is enough time to try something :)


. Wednesday, June 11, 2008

Some interesting interviews to important people from DM&ML communities. Thanks to VideoLectures for hosting all that interesting stuff.

Dr. Usama Fayyad is responsible for Yahoo!'s overall data strategy, architecting Yahoo!'s data policies and systems, prioritizing data investments, and managing the Company's data analytics and data processing infrastructure.

Tom Mitchell is the first Chair of Department of the first Machine Learning Department in the World, based at Carnegie Mellon.

Gregory Piatetsky-Shapiro, Ph.D. is the President of KDnuggets, which provides research and consulting services in the areas of data mining, knowledge discovery, bioinformatics, and business analytics

Journal of Interesting Negative Results in Natural Language Processing and Machine Learning

. Saturday, May 24, 2008

Johannes Fuernkranz sent this announcement to the ML-news list. I think it is a great new to the NLP and ML communities as some negative results can be even more useful than some positive results. This is a good way to prevent others to do not expend time exploring hypothesis that have been invalidated by others.

Journal of Intersting Negative Results

We are happy to announce the on-line publication of the first article in the Journal of Interesting Negative Results in Natural Language Processing and Machine Learning. Please visit and click on "articles".

JINR is an electronic journal, with a printed version to be negotiated with a major publisher once we have established a steady presence. The journal will bring to the fore research in Natural Language Processing and Machine Learning that uncovers interesting negative results.

It is becoming more and more obvious that the research community in general, and those who work NLP and ML in particular, are biased towards publishing successful ideas and experiments. Insofar as both our research areas focus on theories "proven" via empirical methods, we are sure to encounter ideas that fail at the experimental stage for unexpected, and often interesting, reasons. Much can be learned by analysing why some ideas, while intuitive and plausible, do not work. The importance of counter-examples for disproving conjectures is already well known. Negative results may point to interesting and important open problems. Knowing directions that lead to dead-ends in research can help others avoid replicating paths that take them nowhere. This might accelerate progress or even break through walls!

We propose this journal as a resource that gives a voice to negative results which stem from intuitive and justifiable ideas, proven wrong through thorough and well-conducted experiments. We also encourage the submission of short papers/communications presenting counter-examples to usually accepted conjectures or to published papers.

The journal's scope encompasses all areas of Natural Language Processing and Machine Learning. Papers published in JINR will meet the highest quality standards, as measured by the originality and significance of the contribution. They will describe research with theoretical and practical significance. All theories and ideas will have to be clearly stated and justified by a deep literature review.