CSE503 SoftwareIdol

From PublicWiki
Jump to: navigation, search

Project Ideas

We want to get involved in comparing OSS projects; there are a few different aspects of them that we could compare, and a few different metrics we could (try to) use for actually doing the comparison.

  • Aspects we could compare, from hard to easy:
    • Cost of building/maintaining
    • Successfulness
    • Popularity
    • Ratings
  • Metrics we could use:
    • Machine-learning of differentiating features (training a classifier)
    • "Englishyness"
    • Size (lines of code, number of revisions, etc)
    • Churn (rate of changes over time given repository change tracking)
    • Number of contributors
    • Other (many)

We probably want to combine a hard aspect with an easy metric or vice versa.


  • Come up with a reasonable way of estimating the "cost" of an OSS project; maybe do some metrics.
  • Come up with a reasonable way of defining "success"; analyze how a few not-too-deep metrics, such as size, or churn, or (whatever?) correspond to that split.
  • Take ratings or popularity, and measure some harder approaches (like Englishyness and churn) to try to come up with a correlation.
  • Take the easiest-to-measure aspects, get a good data set (ideally within a domain), and train a classifier using Weka or a similar toolkit; look at most-relevant features for something human-readable. http://en.wikipedia.org/wiki/Weka_(machine_learning)
    • A special case: Split a few code bases temporally and try to use classification of the repository for the *first year* of a project to relate to the success/cost/etc *eventually* (so as to come up with a predictor)

Also, this isn't a strict split, so: compare one or more aspect against one another - how does cost relate to ratings and popularity? How does popularity relate to success?

Current Work

We're currently focusing popularity, as measured by ratings and downloads. As of 2/5, we're making a list of media-related software and running easy metrics against that.

To-do list:

  • Cynthia is writing Dr. N. to try to schedule something for Friday, either after 4:30 (sigh) or between 10:30 and 12.
  • Alex is generating a text file with (up to) 100 pieces of media-related; OS; multiple-contributor software to start analyzing. This will include number of contributors and size/lines of code, per Ohloh.
  • Cynthia is collecting ratings/etc metrics for that software from various sites, and populating the table with that.
  • Suporn is figuring out what the right approach to counting search hits is, and populating the table with that.
  • Meet with Dr. N, 10:30am (ugh) Friday 2/8, 5th floor breakout.

OSS Software

Ohloh seems to be a good source. An example of two pieces of media software listings:

Lists of Best Open-Source Software Lists:

The first software that was mentioned in all of the best open-source lists (some are listed below) is Firefox. So I started looking at Firefox and realize that the "add-ons" are open-source. Though people can argue that anything we find in there might be Firefox-only characteristic, it's a good place to start looking with a smaller scope.

Firefox add-on browser has a feature to rank add-ons by popularity and ratings. This is a great example of when popularity and rating don't correlate. For example, the top-rated download management add-ons are far from the most popular ones. The simple reasons behind this are:

  1. there are only a few people who rates any one of the add-ons.
  2. the ratings are not normalized by the number of people who rate the add-ons.

Ratings Sources

More on Metrics

Search hits: Experiments with using the same list to do web hit counts are interesting. Audacity, Gimp, Pidgin, Eraser, etc. are words with multiple meanings. eMule, eMule morph and eMule Xtreme Mod are all separate downloads, but the hit counts overlap. Some things are too short a string (such as "ABC [Yet Another Bittorrent Client]").

General OSS Resources

Cost of an OSS project

A great advantage of Open-source software over a commercial software is the price. Users get the software for free, but there is some work that was put in to these great products. What are the actual costs of making these software?

Ohloh has an "estimated number of person-years" and associated estimated cost.


"Indeed, as we have repeatedly emphasized, the Internet is the primary enabler of the OSS development and distribution process, making it possible for widely distributed groups to share ideas and software extremely quickly at negligible cost." Understanding Open Source Software Development By Joseph Feller, Brian Fitzgerald

"But open source is a low-cost way of increasing the opportunity for surprise." Lessons from Open Source software development, Tim O'Reilly 1999

Success of Open Source Project

Most downloaded on sourceforge: http://sourceforge.net/top/topalltime.php?type=downloads


Defining Open Source Software Project Success, Kevin Crowston, Hala Annabi, and James Howison, 2003
http://floss.syr.edu/publications/icis2003success.pdf This paper identify a range of measures that can be used to assess the success of open source software (OSS) projects.

Information Systems Success in Free and Open Source Software Development: Theory and Measures

The perils and pitfalls of mining SourceForge

Useful Links

Motivations of open-source developers:

Working for Free? Motivations for Participating in Open-Source Projects

Why Open Source software can succeed:

Case Studies: A Case Study of Open Source Software Development: The Apache Server

How to Evaluate Open Source Software / Free Software (OSS/FS) Programs:

Related Work

FLOSS (Free/Libre and Open Source Software) includes both "free software" and "open source" which differs but share important characteristics. Researchers have been investigating FLOSS from different aspects.

Howison and Crowston reported the easily accessible sourceforge data to be difficult to gain meaningful result, because the data is dirty and needs careful screenning and deep analysis.

Crowston et al. investigates what success means within FLOSS context and examines the result by an empirical study using data from SourceForge.

Mockus et al. collected many claims/conventional wisdom about open source software and answered the claims by analyzing email archives of source code change history and problem reports.

David Wheeler wrote a manual for a user to pick an open source software.

Conceptual integrity is originally from Brooks, The Mythical Man-Month.

"Eyeballs" is actually Linus' Law, from Eric Raymond, The Cathedral and the Bazaar.