The next steps for this week are basical…


Sorry for my kinda late report, but I was hard coding my GSoC project. The past week I’ve been testing some algorithm to suggest friends (based on what friend they do follow on common). The first hard task was getting some dataset to test some ideas that I had at my tiny little mind, so I wrote a brief Twitter crawler and I download some users (~ 5,000).

When I downloaded the dataset (it took awhile since Twitter limit the API calls 100/hour), I run the KMeans algorithm ( to find groups of similar users (based on what they follow – aka “Friends”), and after 7 minutes of running time it discovered around 50 groups of user. The query process is pretty simple, simply compare the user’s friend against the groups of users, choose the one which is more similar and return the diff of friends as suggestion. It has great results on my tests.

The main feature of this algorithm is that it can also suggest friends to new users (that did not exist at the computation time) as long as they have some friends. Another key feature is that the computation does not to be very frequently, it depends on how many new friends has been added to the system, probably every week is fine for large BP installations.

As you may notice, the computation time takes awhile (7min per 5,000 friends with 300,000 nodes of friendship is not that bad) in a near future it will be able to be computed in parallel with Hadoop for largest installations since the KMeans problem could run in parallel, but I’ll cover it later since “premature optimization is the root of evil”.

The next steps for this week are basically:

  • Move the phpcluster to a PHP4 version and integrate with the system
  • Ping to my mentor to discuss about the off-line computation, how it would be handle and integrated with the system.

The project has a git public repository (, actually it has just the skeleton, during the week I’ll upload my test.

Best regards,