Project Proposal

The main objective of this proposal is increase the Buddypress’ user experience, by suggest interesting things related to him tastes. Basically BP’s (or MU’s) will keep in track about the user reads most, where he commented more, about what he usually writes, what tags he used most, and so forth, in order to know more about the user.

Then in a near future, after day of weeks of using a BP installation the system should be able to suggest him things as:

1. Which users share the same tastes.
2. Where he should comment,
3. What he could find interested to read
4. While he’s writing suggest tags based on the context.
5. If the user has not idea about a subject to write, he could see what’s the folk is writing (text clustering, as Google News does).
6. Other ideas related to this subject are very welcome.

Since two years ago, after I’ve read the paper http://citeseer.ist.psu.edu/68861.html I’ve been really interested in Text classifications, then I read more papers (and some books) and I hooked up with this interesting subject plus unsupervised learnings, learning from user input. I’ve created some small projects as proof-of-concepts:

* http://www.phpclasses.org/browse/package/4236.html
* http://www.languess.com/
* And now days writing a PHP wrapper for the well known libtextcat (http://github.com/crodas/phplibtextcat/tree/master).

Schedule of Deliverables

What are the milestones and deliverables for your project?

The project would be divided into two big stages:

1. Designing patterns, creating base classes with common functions. 1 week.
2. Start coding Social algorithm to suggest friends, suggest potential interesting blog posts, and others request by the folks, Automattic or the mentor. 2 Weeks
3. Start coding the Text processing algorithm, it would include:
1. Unsupervised learning (a.k.a Text clustering). 2 Weeks
2. Supervised learning, which means generate features (n-grams, words) from text, the associate to metadata (text, categories, tags) for future reuse. 2 Weeks
4. Optimization, the project should run from shared hosting to large clustering of servers, this means that the project should be thinking to run in several times, saving the actual state (due Apache Timeout), and be able to run from several machines at the same time (a few of data partition, and parallel processing). After the midterm.

The item 2 and 3 would be done in parallel. The goal for the midterm is get everything working, and then optimize and create scalable API.