Kaggle just concluded their Personalized Ranking Competition hosted by Expedia. Here is a link to the competition
This competition is a standout because the dataset provided is considerably bigger than their earlier competitions (about 10M X 30) and the target classes are unbalanced with about 90% of provided data consisting of single class. Any straightforward learning algorithm has many issues:
- Every new feature increases the size of train data significantly, and soon, with a few number of features, the data becomes too big to fit in main memory
- A simple cross-validation produces a very high validation score because of unbalanced classes and fails to generalize
Of course, these problems are solvable, and the approach the we tried had few components.
- Carefully choose train-validation data to keep roughly the same proportion of all classes. More importantly, keep all data-points from same user either completely in train or cross validation set
- One-vs-all classification approach was found to do better than a multiclass classification. Even there, multiple ordinal classification seemed to perform better. Here is a paper discussing this approach
We used standard gradient boosting for training and obtained an nDCG score od about 0.506. Not a terribly good score, but easily better than benchmark and within top 16% of all contenders. So, no complaints!