Grapevine: An Internet Distributed Computing Framework

Tapping into the unused resources of volunteer computers connected to the internet is not a new idea. Indeed, there are currently many active efforts in this area (e.g. the SETI project, distributed.net, Parabon). While many early distributed computing efforts targeted specific applications, more recent work has focused on the development of general distributed computing infrastructure. Leaders in this field include Parabon Computation, United Devices, and the Globus Project. Each of these efforts has developed a software infrastructure that supports distributed computing.

For our class project, we propose to explore some fundamental and practical issues in distributed computing on volunteered nodes connected to the internet by designing and implementing a simple distributed computing infrastructure. To gain an appreciation for the types of computing applications that are suited for a distributed computing environment, we also intend to develop one application to be deployed on the infrastructure.

Thus the project will consist of essentially two parts.

Part 1 : Design of the Infrastructure

To facilitate the design process, we plan to study the designs of the Parabon Computation, United Devices, and Globus Project infrastructures.

A few key features of this project that make it interesting are:

Preliminary Infrastructure Design Ideas

There will be two primary types of nodes in the infrastructure:

A simple model for the "program fragment" might be that they are serialized objects that implement some Java interface that we define. As a possibility, we could require that users implement a class that extends the Thread class and implements the Serializable interface plus a few of our own methods. Then when the job is submitted, the dispatcher could serialize the user class, send it over the network to the worker daemon which would unserialize the object and run the thread.

Some immediate issues to work out include:

Part II: Distribted Application

Machine Learning Classifiers

Machine learning techniques have a wide range of practical applications in fields like data mining, information retrival, image processing ,bioinformatics, stock prediction etc. A machine learning application typically involves finding inherent patterns in massive data that may not be interpretable by human. We are referring to the set of supervised learning algorithms here, which involves a training and testing phase. In the training phase, we are concerned with the selection of tranining examples and the model used to represent the data.

Some of the widely used models are listed below:

Bagging and Boosting: Bagging and Boosting are two common example-based techniques used to build classifier from smaller subsets of the dataset. In bagging, training examples are randomly selected with replacement to form smaller subsets. These subsets are used to train separate classifiers. A voting mechanism is then used to gather results from the mutiplier classifers. Boosting involves iteratively selecting the "difficult" examples as training examples. These techniques have demonstrated great success in improving the accuracy of the classifiers at the expense of high computation cost as multiple classifiers need to be built. As such, it fits nicely into our paradigm of distributed parallel computing where the multiple classifiers can be trained on different machines.

Task Definition: Mechanism for Bagging and Boosting

We would like to build a general framework to support Bagging and Boosting in a parallel distributed computing environment. Bagging is inherently parallelizable and boosting can be adapted to work in a parallel way. Our mechanism should support any models that adhere to our specified format of input/output for training file.

Problems

Demonstration of Result - Text Categorization

Imagine building a text categorization system (as in Yahoo, but involves not only websites but all documents residing on the machines in the environment). On each machine there is an individual classifier that will train on the documents residing on that machine. In boosting mode, it can send the documents that it classified incorrectly to other machine for training too. The classifier on each machine needs to be retrained after the documents increase by a certain number.

A comittee based classifier will be built using a voting mechanism based on these classifiers.Different machine learning algorithms involve different costs for training and testing.We will parallelize either the training phase, the testing phase or both phases, subject to time constraints, feasibility and usefulness.

Basic assumptions