Wednesday, 27 November 2013

Did the kaggle titanic competition with vw. Just put all features, name as bag of words and others with attribute==X. Doing 1 pass gave best results (111 on leaderboard), 2,5,10 gave progressively worse (but only slightly - all at 172 on leaderboardt).

My new github where I publish kaggle code

https://github.com/umarnawaz

Tuesday, 26 November 2013

Entered a few Kaggle Competitions

I entered a few Kaggle competitions for fun and so I could put them on my resume.

On all of them I scored somewhat below the middle of the leaderboard.

I did very little feature engineering, I simply loaded them into postgres (except for digit recognition which just formatted for vw using python), outputted them to text files in vw format and used some shell scripts to unix paste and vw run them. VW automatically treat text as bag of words so it gave reasonable results as is. Overfitting exists - running 1000 passes appeared to give better results  at vw console but gave worse on kaggle submission.

For see click predict I first did a multiclass --oaa and did some feature engineering on the timestamps in postgres (date truncs and date parts) and ceilinged the lat/longs. The competition had only 4 days left and this was my first competition so some time was spent learning the kaggle site and the evaluation metric (log). I again did the competition but logarithmed the outputs as regression but got worse results than the multiclass.

Partly sunny with a chance of hashtags, digit recognition, see click predict fix.

Monday, 18 November 2013

A pretty good vowpal wabbit tutorial

Someone wrote a pretty good vowpal wabbit tutorial http://spiderspace.wordpress.com/2013/08/22/vowpal-wabbit-tutorial-for-the-uninitiated/

Vowpal Wabbit on MNIST

Running vowpal wabbit on the digit recognition challenge of kaggle with a 1-100 passes and logistic loss gives 88-90%. Using quadratic and cubic features appears to give no improvement. There was thread on kaggle saying 1000 passes and quadratic features gives 97%, so I'll being checking that out later. I'll also be trying lda from vw and seeing if that gives improvements.

Saturday, 16 November 2013

Competitions Kaggle should run

A list of competitions kaggle should run. They could be financed by crowdfunds (on kickstarteror indiegogo.

Wikipedia - given some number of words, 100 let's say, predict the title

Jeopardy - same as IBM Watson ran

Image recognition - use something like Kaggle or 1000,000 categroes of images from flickr, predict the category

Conversation Turing test - take wikipeida articles or chat conversation, chop them, predict the next word or remainder of sentence.

Law - given text from both side of trials, predict the verdict




Thursday, 14 November 2013

How I approach machine learning

I have a simplified conception of machine learning with a few basic algorithms

decision trees and ensembles of them (boosting/random forests) - I haven't looked much into these but might test them in the future (probably just use waffles)

perceptron (linear regression etc.) - there is only 1 global minimum and so derivative is useful. Just vowpal wabbit for it (get as much data and engineer as many features as possible and let vw sort it out)

neural nets - perceptrons wired together - many global minimum, derivative maybe/maybe not useful. Academic researchers try out different optimizers (sgd most common). I would think simulated annealing would do good on it. I tried the vowpal nnet but it took too long to run and gave poor results (researchers use gpu's to get performance).

autoencoders, topic models - unsupervised liearning. I just use gensim's implemetations. It has tfidf, lsi, rp, lda, hdp. I've only tested out tfidf, lsi, and rp, and might only use rp in the future.

naive bayes - count stuff up.

For my recommender systems projects I plan on sticking to only the vector spaces (gensim), perceptron (vw), and naive bayes (probably do it in sql or awk). I don't have much computer power so things like nnets are too much for now.

Software I use for recommender systems

I use python, haskell, postgresql, emacs, gensim, vowpal wabbit, and debian (I don't much about other tools like R/matlab that would be good for recommender)

Potentially useful -
python libraries are - numpy, scipy, scikit-learn, pandas  gensim
unix tools are - awk, sed, vowpal wabbit, waffles machine learning
sql tools are - postgres, sqlite

In open source databases there is pytables, postgres, sqlite, index db's like tokyo, leveldb. Out of these I just use postgres (it's fast enough - index db's and pytables probably aren't worth the difference).

Of machine learning libraries I use gensim and vowpal wabbit. Both are on-line (use little memory) and fast. The others are good and I've tested them a bit (but my desktop only has 500mb).

In languages I use python, awk, sql, and haskell (depending on my mood - they're all really fun to work in). I use awk/wc/grep for simple searches/counts of files, python or haskell to do more complex extractions. SQL is best for the most complex querying and I'm trying to build up a whole bunch of tables for auto-tagging text.

Data sets on my hard drive

I sometimes download data on to my computer for fun/possible entrepreneurial opportunity (sometimes data is withdrawn later so it's best to get personal copy).

AOL search data they released a few years ago (248mb)

music brainz (4.2gb)

Census data
  US Business Census data probably from Bureau of Labor Statistics
    establishments, employees by NAICS classification (3.3mb)
  some other BLS (531mb)
  some US government employee stats (2.7mb)
  Canada census data (321mb)
  UK census partly (202kb)
  household expenditures (96kb)
  retail survery (489kb)
  susb naics (2mb)
  UN world occupation data (49mb)
 
Time Magazine covers (253mb)

USDA nutrients (54mb)

ONET Skills from US Labor Bureau - data about job types and their duties (31mb)

guardian data sets csv (416k)

harvard library metadata (2.3gb)

Computer vision data
  CIFAR 100 (200mb)
  imagenet urls (338mb)
  Faces dataset (from somewhere) (553mb)
  mirflickr 1 million images (48gb)
  mirflickr 25 (3gb)
  poselets (1gb)
  trecvid (4.8mb)
  berkeley 3d kinect (800mb)
  caltech256 (1.7gb)
  microsoft kinect gestures (541mb)
  pascal vision challenge (2.5gb)

Corp Watch (72kb)

various kinds of finance data (1gb)
  daily summaries from NASDAQ, AMEX, NYSE
  prices pulled from yahoo api
  kenneth french research
  robert schiller data
  sec data

DBpedia (4.4gb)

dmoz (400mb)

Freebase (38gb)

Movie/tv data sets
  IMDB (1.2gb)
  netflix prize (700mb)
  tvtropes (1gb)

various kaggle data sets (8.7gb)

sherlock holmes stories (14mb)

from google
  wikilinks (1.8gb)
  wikipedia crosslinks (8.5gb)

wikipedia page counts (112gb)
wikipedia (10gb)