Today marks the end of Kaggle's MarineExplore Whale Detection challenge. The challenge, simply stated, is this: You are given You are given a set of 2-minute .aiff sound files, some containing sound from some species of whale, while others containing other ambient noises in sea (possibly including sounds from different species of whale). The dataset consists of a 0/1 label train data (30000 samples) and a unlabelled test data (54503 samples). The challenge was to predict the presence of the relevant species of whale in test set . Like many, my initial approach was to read the aiff files and directly use sound frequencies from the file as features. This approach helps 'break-into' the 0.90 AUC (Area-Under the-Curve) score. Some of the most successful submissions, however, treat the problem as an image-processing problem, treating audio spectrogram as relevant feature. Check this forum for more information on these approaches. Using this approach, I have been able to obtain an AUC of 0.96016 with a respectable 56th place out of 249 participants. This gives me a (sorta) coveted Top 25% badge on Kaggle. Click here to checkout my code on Github.
I've recently participated a basic Kaggle Competition arranged by floks at Scikit. Here is a link to the competition. http://www.kaggle.com/c/data-science-london-scikit-learn My biggest take-away from it is ipython notebook. A cool tool like R notebook to run and document your data analysis in browser. Here is my first ipython notebook: http://nbviewer.ipython.org/url/dl.dropbox.com/u/69791784/ipython%2520notebooks/Expository%2520Analysis.ipynb.
I have been playing with Python's machine learning/big data packages and must say that they give R quite a run for money! For now, I can offer a step-by-step installation guide for installing these packages on mac OSX. Click here Finally, here is the main page for sklearn and amazing things it can do. Go like!
Nothing much to report, except this new exciting course announced on Coursera....Startup Engineering.
I remember using C++ pthread_mutex's in ancient past (well, during undergraduate years). That was my entire exposure to multithreading. Well, that and a little of Java's Thread. That was till last week. Like everything else, C++ multithreading has been give a boost (pun, eh) with boost library. Here, in nutshell is how it works:
Survived a big storm and a bout of cough/cold after that. In the meanwhile, learned a thing or two about factory pattern . The best resource I can give right now is this stackoverflow post.
I am currently digging for resources on 'big data with R' Here are few that I have found