Recsys 2013: Yelp! Business Prediction Contest

I got an interesting email from Prof. Nicholas Ampazis from University of Aegean, Greece. Nicholas is trying out GraphChi gensgd for Kaggle's Yelp! business prediction contest which is part of Recsys 2013.

First he sent me some interesting observations about the dataset:


- There are 2108 training users in the ratings (review) matrix that do appear in the training users file. The reverse is not true (i.e. all users in the training user file have ratings).
- All business_ids in review appear in the business file
- There are 5315 users for which we wish to make predictions that do not appear in the ratings matrix.
- There are 1205 businness_ids  for which we wish to make predictions that do not appear in the ratings matrix (those always come in pairs with the unknown users above).
- The union of (distinct) users in the ratings matrix, training user file and test user file is 51082
- The union of (distinct) business_ids in the ratings matrix, training business file and test business file is 12742


Nicholas has kindly agreed to share with us some of the scripts he is using, to convert the Yelp! data to GraphChi: (written by his colleague Vaggelis Tripolitakis - thanks!!).

Disclaimer: we did not fine tune performance of gensgd yet so prediction quality is still poor. We plan to refine execution in the next couple of days and report results here.

0) Register to the competition here and download the datasets into your root GraphChi folder.

Method A: use Vaggelis scripts (Ruby)
1) Download the conversion scripts from GitHub:
https://github.com/vtripolitakis/yelpscripts
2) Give running permission to the script:
# chmod a+rx script

3) Verify that json ruby library is present using:
# sudo gem install json

Note: if you do not have root permission on your machine, install the package using
# gem install json
and add the locally created gem folder into your path, for example:
# export PATH=$PATH:/home/bickson/.gem/ruby/1.8/bin

Method B: use Justin Yan's scripts (Python):

Preliminaries: you will have to install python pandas
This script is based on script by Paul Butler.

create a file name conv_json2csv.py with the following lines:

'''
Convert Yelp Academic Dataset from JSON to CSV

'''

import json
import pandas as pd
from glob import glob

def convert(x):
    ''' Convert a json string to a flat python dictionary
    which can be passed into Pandas. '''
    ob = json.loads(x)
    for k, v in ob.items():
        if isinstance(v, list):
            ob[k] = ','.join(v)
        elif isinstance(v, dict):
            for kk, vv in v.items():
                ob['%s_%s' % (k, kk)] = vv
            del ob[k]
    return ob

for json_filename in glob('*.json'):
    csv_filename = '%s.csv' % json_filename[:-5]
    print 'Converting %s to %s' % (json_filename, csv_filename)
    df = pd.DataFrame([convert(line.strip().replace("\\n"," ").replace("\\r"," ")) for line in file(json_filename)])
    df.to_csv(csv_filename, encoding='utf-8', index=False)

Run
# python conv_json2csv.py

4) Use the following instructions for converting the data to GraphChi format
(hint: use copy & paste!)

###################### TRAINING SET ##########################

#---REVIEW---
./script yelp_training_set/yelp_training_set_review.json user_id business_id date votes stars > yelp_training_set_review.csv

#----USER---
./script yelp_training_set/yelp_training_set_user.json user_id review_count average_stars name votes > yelp_training_set_user.csv

#----BUSINESS----
./script yelp_training_set/yelp_training_set_business.json business_id open city state review_count longitude latitude categories name neighborhoods full_address stars > yelp_training_set_business.csv

##############################################################


###################### TEST SET ##########################

#---REVIEW---
./script yelp_test_set/yelp_test_set_review.json user_id business_id > yelp_test_set_review.csv

#----USER---
./script yelp_test_set/yelp_test_set_user.json user_id review_count > yelp_test_set_user.csv

#----BUSINESS----
./script yelp_test_set/yelp_test_set_business.json business_id open city state review_count longitude latitude categories name neighborhoods full_address > yelp_test_set_business.csv

##############################################################
######### CONCATENATE USER/BUSINESS FILES FROM TRAIN AND TEST ##########################

cat yelp_training_set_user.csv yelp_test_set_user.csv > user_file.csv

cat yelp_training_set_business.csv yelp_test_set_business.csv > business_file.csv

5) Run GraphChi GENSGD
a) Prepare a file named yelp_training_set_user.csv\:info with the following 2 lines:
%%MatrixMarket matrix coordinate real general
51082 12742 229907 
And a second file named yelp_test_set_review.csv\:info with the following 2 lines:

%%MatrixMarket matrix coordinate real general
51082 12742 22956


b) First trial: run using reviews only (without user and business information)

[email protected]:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=yelp_training_set_review.csv --test=yelp_test_set_review.csv --from_pos=0 --to_pos=1 --val_pos=2 --rehash=1 --gensgd_mult_dec=0.999999 --quiet=1 --file_columns=3 --minval=1 --maxval=5  --clean_cache=1
WARNING:  common.hpp(print_copyright:180): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to [email protected] 
[training] => [yelp_training_set_review.csv]
[test] => [yelp_test_set_review.csv]
[from_pos] => [0]
[to_pos] => [1]
[val_pos] => [2]
[rehash] => [1]
[gensgd_mult_dec] => [0.999999]
[quiet] => [1]
[file_columns] => [3]
[minval] => [1]
[maxval] => [5]
[clean_cache] => [1]

...
   2.60862) Iteration:   0 Training RMSE:    1.22329
   3.39838) Iteration:   1 Training RMSE:    1.18201
    4.2365) Iteration:   2 Training RMSE:    1.16143
   5.04867) Iteration:   3 Training RMSE:    1.14613
   5.89126) Iteration:   4 Training RMSE:    1.13354
   6.70683) Iteration:   5 Training RMSE:     1.1225
Found 2466 new test users with no information about them in training dataset!

c) second run: throw in user information

[email protected]:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=yelp_training_set_review.csv --test=yelp_test_set_review.csv --from_pos=0 --to_pos=1 --val_pos=2 --rehash=1 --gensgd_mult_dec=0.999999 --quiet=1 --file_columns=3 --minval=1 --maxval=5 --user_file=user_file.csv  --clean_cache=1
WARNING:  common.hpp(print_copyright:180): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to [email protected] 
[training] => [yelp_training_set_review.csv]
[test] => [yelp_test_set_review.csv]
[from_pos] => [0]
[to_pos] => [1]
[val_pos] => [2]
[rehash] => [1]
[gensgd_mult_dec] => [0.999999]
[quiet] => [1]
[file_columns] => [3]
[minval] => [1]
[maxval] => [5]
[user_file] => [user_file.csv]
[clean_cache] => [1]
...
   3.14781) Iteration:   0 Training RMSE:    1.21868
   4.17876) Iteration:   1 Training RMSE:    1.10707
   5.20784) Iteration:   2 Training RMSE:    1.05591
   6.28441) Iteration:   3 Training RMSE:    1.01406
   7.31922) Iteration:   4 Training RMSE:   0.975489
   8.33992) Iteration:   5 Training RMSE:   0.939978
Found 2466 new test users with no information about them in training dataset!

d) third run: throw in also business information:


[email protected]:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=yelp_training_set_review.csv --test=yelp_test_set_review.csv --from_pos=0 --to_pos=1 --val_pos=2 --rehash=1 --gensgd_mult_dec=0.999999 --quiet=1 --file_columns=3 --minval=1 --maxval=5 --user_file=user_file.csv --item_file=business_file.csv --clean_cache=1
WARNING:  common.hpp(print_copyright:180): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickso[email protected] 
[training] => [yelp_training_set_review.csv]
[test] => [yelp_test_set_review.csv]
[from_pos] => [0]
[to_pos] => [1]
[val_pos] => [2]
[rehash] => [1]
[gensgd_mult_dec] => [0.999999]
[quiet] => [1]
[file_columns] => [3]
[minval] => [1]
[maxval] => [5]
[user_file] => [user_file.csv]
[item_file] => [business_file.csv]
[clean_cache] => [1]
...
   3.62809) Iteration:   0 Training RMSE:    1.29575
   5.10944) Iteration:   1 Training RMSE:    1.06187
   6.50959) Iteration:   2 Training RMSE:   0.995394
   7.92686) Iteration:   3 Training RMSE:   0.947596
   9.35034) Iteration:   4 Training RMSE:   0.906372
   10.7604) Iteration:   5 Training RMSE:    0.86826
Found 2466 new test users with no information about them in training dataset!

Conclusion: including user and business properties significantly improves prediction performance.


The output of gensgd is the file yelp_test_set_review.csv.predict

%%MatrixMarket (null)
22956 1
   1.4793704
N/A
   3.3002301
   2.8208445
   4.0713396
N/A
   3.1468302
   3.6243955


Next: soon I will post some update about performance of GraphChi & how to create the submission format out of GraphChi.
Questo articolo รจ stato pubblicato in Senza categoria da Danny Bickson . Aggiungi il permalink ai segnalibri.