MSR Mining Challenge 2010: 7th IEEE Working Conference on Mining Software Repositories

MSR Mining Challenge 2010: 7th IEEE Working Conference on Mining Software Repositories
http://2010.msrconf.org

May 2nd-3rd, 2010
Cape Town, South Africa

Co-located with ICSE 2010,
International Conference on Software Engineering

Challenge Chair

Abram Hindle
Univ. of Waterloo, Waterloo, Canada

General Chair

Audris Mockus
Avaya, USA

Program Co-chairs

Jim Whitehead
University of California, Santa Cruz

Thomas Zimmermann
Microsoft Research, USA

Jury / Program Committee

Israel Herraiz
(Complutense University of Madrid, Spain)
Emily Hill
(University of Delaware, USA)
Annie Ying
(McGill University, Canada)
Emad Shihab
(Queen's University, Canada)
Zhen Ming Jiang
(Queen's University, Canada)
Rahul Premraj
(Vrije Universiteit Amsterdam, Netherlands)
Irwin Kwan
(University of Victoria, Canada)
Lile Hattori
(University of Lugano, Switzerland)
Adrian Schröter
(University of Victoria, Canada)

Location

Co-located with ICSE 2010,
Cape Town, South Africa

Earlier MSRs

MSR 2009 – Vancouver
MSR 2008 – Leipzig
MSR 2007 – Minneapolis
MSR 2006 – Shanghai
MSR 2005 – Saint Louis
MSR 2004 – Edinburgh

Previous Challenges

MSR Mining Challenge 2009
MSR Mining Challenge 2008
MSR Mining Challenge 2007
MSR Mining Challenge 2006

Latest news

The MSR 2010 Prediction Challenge is extended 2 days! Submit your predictions by February 22 and you could win a Zune HD!

new Mining Challenge Deadline has been extended!

new A CREX (CTags based) extraction of the FreeBSD project is added!

new We've put up a parsed version of the FreeBSD bug database!

Overview

Since 2006 the IEEE Working Conference on Mining Software Repositories (MSR) has hosted a mining challenge. The MSR Mining Challenge brings together researchers and practitioners who are interested in applying, comparing, and challenging their mining tools and approaches on software repositories for open source projects. Unlike previous years that have examined a single project, multiple projects in isolation, or a single distribution of projects (GNOME). This year the MSR challenge involves examining FreeBSD? operating system and distribution, the GNOME Desktop Suite of projects, and the Debian/Ubuntu Distribution Database. The emphasis this year is on how the projects are inter-related, how they interact and possibly how they evolve and function within a larger software ecosystem. There will be two challenge tracks: #1: general and #2: prediction. The winner of each track will be given the MSR 2010 Challenge Award.

Challenge #1: General

In this category you can demonstrate the usefulness of your mining tools. The main task will be to find interesting insights by analyzing the software repositories of the projects within FreeBSD, GNOME Desktop Suite and the package related meta-data of the Debian/Ubuntu Distribution Database.

FreeBSD is a BSD license BSD Unix distribution. It includes packages for desktop, server and embedded uses. FreeBSD also takes responsibility for porting many programs to its distribution via FreeBSD-ports.

GNOME Desktop Suite of projects. GNOME is very mature, and composed of a number of individual projects (nautilus, epiphany, evolution, etc.) and provides lots of input for mining tools.

The Ultimate Debian Database (UDD) is a database of packages, package dependencies and related bugs. It describes the Debian and Ubuntu distributions.

One could examine multiple projects within these ecosystems. For instance, examining API usage across all projects, training a predictive model on one project and assessing its accuracy on another, or examining how developers' activity spans multiple projects.

Participation is straightforward:

Select your mining area (one of bug analysis, change analysis, architecture and design, process analysis, team structure, etc.).
Get project data for multiple GNOME projects, FreeBSD? or the UDD
Formulate your mining questions.
Use your mining tool(s) to answer them.
Write up and submit your 4-page challenge report.
- Within the report you should clearly summarize what your contribution is, including what you found and its importance.

The challenge report should describe the results of your work and cover the following aspects: questions addressed, input data, approach and tools used, derived results and interpretation of them, and conclusions. Keep in mind that the report will be evaluated by a jury. Make sure your contributions, purpose, scope, results and importance or relevance of your work is highlighted within your report. Reports must be at most 4 pages long and must be in the IEEE CS proceedings style - Two Column Format.

The submission will be via Easychair (http://www.easychair.org/conferences/?conf=msrchallenge2010). Each report will undergo a thorough review, and accepted challenge reports will be published as part of the MSR 2010 proceedings. Authors of selected papers will be invited to give a presentation at the MSR conference in the MSR Challenge track.

Feel free to use any data source for the Mining Challenge. For your convenience, we provide repository logs, mirrored repositories, bugzilla database dumps, and various other forms of data linked at the bottom.

Challenge #2: Predict

This year, the MSR Mining Challenge prediction will involve predicting the the final bug number within Debian on April 30th, 2010. We want you to predict the newest bug number to appear on April 30th.

Participation is as follows:

Pick a team name, e.g., WICKED WARTHOGS, BAD BIRDS, etc.
Come up with predictions for the final Debian bug report number as of April 30th based on some criteria or prediction model. A very simple model, for instance, would be the amount of growth in the past three months.
- Predict the final bug number of Debian at the end of the day on April 30th, 2010 (according to their server time (UTC))
  - e.g. as of Date: Tue, 13 Oct 2009 23:21:01 UTC there were 550906 was the largest bug number:
    - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=550906
Write a paragraph (max 200 words) that describes how you computed your predictions.
Submit everything before Feb 20th (Apia time) by email to msr2010predictions@challenge.softwareprocess.es

Prediction submissions will be scored by their distance from the last bug number that occurs on April 30th 2010.

Frequently Asked Questions

Do I need to give a presentation at the MSR conference? For challenge #1, the jury will select finalists that are expected to give a short presentation at the conference. Then the audience will select a winner. For challenge #2, there is no presentation at the conference. The winners will be determined with statistical methods and announced at the conference.
Does the challenge report have to be four pages? No, of course you can submit less than four pages. The page limit was set to ease the presentation of space-intensive results such as visualizations.
Wow, the data set is soooo big! My tool won't finish in time. What can I do? Just run your tool on a subset of the projects. For instance, you could examine only the nautilus file manager and the epiphany web browser. Especially when you are doing visualizations, it is almost impossible to show everything.
My cat is a visionary...can I submit its predictions or is the challenge #2 only for tools? Of course, go ahead and submit its predictions as a benchmark. However, your cat will run out of competition—only predictions generated by tools or by humans in a systematic way are eligible to win challenge #2.
For the prediction challenge, can random guesses also win? If the randomness is systematic then it is allowed, if the randomness is human generated it is allowed. In general it must be systematic randomness.
For the challenge #2-prediction, is it acceptable if our team submit more than prediction? Only one submission from a team (person) is allowed.
Do I have to attend in order to win the prize for either challenge? Yes you do or someone must attend who can pick it up for you, we want to avoid the complication of shipping prizes around the globe.

Important Dates

Submission of reports:		~~February 6th, 2010~~ February 7th, 2010
Submission of predictions:		February 20th, 2010
Author notification:		February 20, 2010
Camera-ready copy:		March 12, 2010
Conference dates:		May 2nd-3rd, 2010

Note: All deadlines are 11:59 PM (Apia, Samoa Time) on the dates indicated.

Data

Directory of the FreeBSD and UDD data:
- http://swag.uwaterloo.ca/~ahindle/challenge2010/
The GNOME Data: http://2009.msrconf.org/challenge/msrchallengedata.html
UDD - Ultimate Debian Database
- http://wiki.debian.org/UltimateDebianDatabase/udd.debian.net Scheme and description
- http://udd.debian.org/ main site
- More schema http://udd.debian.org/schema/
- Our official mirror of the UDD: http://challenge.softwareprocess.es/20090913.udd.sql.gz
- Their version 800mb file http://udd.debian.org/udd.sql.gz
Extracted Repos
- http://challenge.softwareprocess.es/freebsd-svn-20090916.tar.bz2 (full freebsd svn mirror)
- http://challenge.softwareprocess.es/freebsd-mail-archive-20090913.tar-split/ (mail archive of freebsd project)
- http://challenge.softwareprocess.es/freebsd-20090913.tar.bz2-split/ (freebsd everything except extracted data)
- http://challenge.softwareprocess.es/freebsd-gnats-20090913.tar.bz2 (freebsd bug emails)
Analyzed Repositories
- Freebsd indentation analyzed + mccabe and halstead per diff chunk
  - http://challenge.softwareprocess.es/freebsd-indentation-metrics-20090913.tar.bz2
- http://challenge.softwareprocess.es/freebsd-20090913-cvsanaly-mysql-dump.txt.bz2 (freebsd CVSAnalY dump)
- http://challenge.softwareprocess.es/ports-20090913-cvsanaly-mysql-dump.txt.bz2 (freebsd ports CVSAnalY dump)
- http://swag.uwaterloo.ca/~ahindle/challenge2010/freebsd_bugs.sql.bz2 The FreeBSD bugo repo parsed by Israel Herraiz
- http://www.emad.softwareprocess.es/freebsd_crex_output.bz2 The FreeBSD CVS repo parsed with CREX by Emad Shihab bzip2 compressed (20MB)
- http://www.emad.softwareprocess.es/freebsd_crex_output.bz2 The FreeBSD CVS repo parsed with CREX by Emad Shihab in RAR Format (20MB)
The files
- http://challenge.softwareprocess.es/freebsd-20090913-cvsanaly-mysql-dump.txt.bz2 (freebsd CVSAnalY dump)
- http://challenge.softwareprocess.es/ports-20090913-cvsanaly-mysql-dump.txt.bz2 (freebsd ports CVSAnalY dump)
- http://challenge.softwareprocess.es/freebsd-indentation-metrics-20090913.tar.bz2 (freebsd ports indentation metrics dump)
- http://challenge.softwareprocess.es/freebsd-gnats-20090913.tar.bz2 (freebsd bug emails)
- http://challenge.softwareprocess.es/20090913.udd.sql.gz (ubuntu debian database)
- http://challenge.softwareprocess.es/svnmirror-base-r179637.tbz2 (original freebsd svn mirror)
- http://challenge.softwareprocess.es/freebsd-svn-20090916.tar.bz2 (full freebsd svn mirror)
- http://challenge.softwareprocess.es/freebsd-mail-archive-20090913.tar-split/ (mail archive of freebsd project)
- http://challenge.softwareprocess.es/freebsd-20090913.tar.bz2-split/ (freebsd everything except extracted data)

Thank you to

The efforts of Christian Bird which made this challenge so much easier to run.
The efforts of Israel Herraiz for parsing the email databases.
The efforts of Emad Shihab for parsing the version control systems.

May 2nd-3rd, 2010Cape Town, South Africa

Co-located with ICSE 2010, International Conference on Software Engineering