Skip to content
Open
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
75726ce
First Commit - Some refactoring of setup code and SVDPP implementatio…
Sep 20, 2013
f33479c
First Commit - Some refactoring of setup code and SVDPP implementatio…
Sep 20, 2013
558482c
SVDPP fixes - still not giving RMSE as cpp version for test set
Sep 22, 2013
15a5a46
First version of Bayesian Probabilistic Matrix Factorization - not te…
Sep 22, 2013
bbaad36
Wrong PMF - Need to implement dynamic edge values?
Oct 2, 2013
e71eeee
Working PMF and SVDPP - RMSE not as low as CPP version
Oct 2, 2013
d784996
Working versions of SVDPP and PMF - only evaluated using training RMSE
Oct 3, 2013
bb3565b
Adding comment about difference with C++ implementation
Oct 3, 2013
1f25ecc
Cleaning imports
Oct 4, 2013
8eb9ad9
Incomplete implementation of LibFM_MCMC
Oct 7, 2013
d0ec864
Merge branch 'metrics' into first-branch
Oct 8, 2013
9cf62fe
Refactoring SVDPP - Use HugeDoubleMatrix instead of individual objects
Oct 8, 2013
13f626c
Refactoring code
Oct 13, 2013
ef5a105
Refactoring ALS and PMF to use efficient data structure for params an…
Oct 13, 2013
b08b520
BiasSgd framework
sam9595 Oct 15, 2013
6ced3f3
Incorrect implementation of LibFM, useful things in there to correct …
Oct 19, 2013
afa50d3
Lot of random changes, LibFM SGD implementation
Oct 22, 2013
e117013
Add BiasSgd
sam9595 Oct 22, 2013
db8219f
Merge branch 'first-branch' of https://github.com/MohtaMayank/graphch…
sam9595 Oct 22, 2013
08e3b46
temp commit
sam9595 Oct 22, 2013
fe9fb8e
commit trial
sam9595 Oct 22, 2013
8fe5271
First commit for generic Data API for rec systems
Oct 24, 2013
252ed2d
Standard framework for recommender systems - Needs some improvement a…
Oct 26, 2013
552723c
More refactoring and comments - LibFM needs debugging
Oct 27, 2013
e9e1919
Merge branch 'first-branch' of https://github.com/MohtaMayank/graphch…
sam9595 Oct 27, 2013
d19b350
modify biasSgd
sam9595 Oct 28, 2013
7388ac1
Implementing code for validation
Oct 30, 2013
a3121d9
Some minor cleanup
Oct 30, 2013
d520eb6
Merging recommender system framework
Oct 30, 2013
2d1c58f
Fixing SVDPP bug
Oct 30, 2013
1efc5bf
modify some ALS lines based on Aapo's suggestions
sam9595 Nov 2, 2013
0df95f5
modify some ALS lines based on Aapo's suggestions
sam9595 Nov 2, 2013
0090c72
lastFM converter and biasSgd setting
sam9595 Nov 4, 2013
03915ab
Merge branch 'first-branch'
Nov 5, 2013
b959e33
Naive Yarn Scheduler
Nov 12, 2013
525d317
Improvements and implementation of RecommederScheduler and Recommende…
Nov 14, 2013
d8604b0
Merge branch 'master' into rec-yarn
Nov 14, 2013
2feb4e9
Working on single node YARN with HDFS. Still some problems with multi…
Nov 18, 2013
65c34e6
Cosmetic changes and comments. Other minor changes
Nov 19, 2013
0473413
Fixing indentation
Nov 19, 2013
f7cabb2
Merge https://github.com/GraphChi/graphchi-java
Nov 19, 2013
0278e5b
Automatic deployment and running of YARN on AWS
Nov 20, 2013
df28228
Adding finishComputation to GraphChiContext. Other improvements in pe…
Nov 22, 2013
5834f69
serialization predictTest
sam9595 Nov 22, 2013
6fa3d14
delete unnecessary comments
sam9595 Nov 22, 2013
27a8fd3
Merging the changes related to serialization of model and prediction
Nov 22, 2013
f00f26f
Ported PMF to new model as well as some changes in RecommenderPool / …
Nov 24, 2013
4fb1070
Adding code for estimating memory usage by graphchi engine
Nov 26, 2013
60d2071
Committing before trying to install new OS
Nov 27, 2013
71deb4d
Fixed parsing parameters, serializtion for all the 5 recommenders.
Nov 27, 2013
2f74ea3
Adding code to serialize model into HDFS and Code to read raw data fr…
Nov 27, 2013
b49343a
Adding missed file to read data from URL / S3
Nov 27, 2013
98157b8
generic_prediction
sam9595 Nov 27, 2013
c3a0fde
merge rec-yarn and rec-yarn-serialize
sam9595 Nov 28, 2013
f35863c
Broken logic for sccheduling
Nov 28, 2013
adafd27
Automatic setup of YARN cluster on AWS
Nov 28, 2013
3a245c7
Merge branch 'rec-yarn' into rec-yarn-serialize
sam9595 Nov 28, 2013
b43d968
Using custom method for building paths instead of Java.nio.Paths
Nov 28, 2013
4128c65
Add Error Measurement Interface
sam9595 Nov 28, 2013
c57605a
Better scheduling logic for YARN
Dec 1, 2013
10ae999
Add yahoo data description and demo model json files
sam9595 Dec 1, 2013
d111685
Merge branch 'rec-yarn' into rec-yarn-serialize
sam9595 Dec 1, 2013
0fb5016
Add functionality to serialize in the middle
sam9595 Dec 1, 2013
687aa07
Adding bias and factor reg for bias sgd, max iterations for all recom…
Dec 2, 2013
e24f682
New Testing class which uses data reader API
Dec 2, 2013
2c781dd
Predicting PMF output with all the samples
Dec 3, 2013
1fcf2bb
README, sample data and some minur improvements
Dec 6, 2013
a0bf4c5
README
Dec 6, 2013
102b41b
YARN README
Dec 7, 2013
61ef118
Add Javadoc comments
sam9595 Dec 9, 2013
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 43 additions & 30 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,18 @@
<name>Sonatype Nexus Snapshots</name>
<url>https://oss.sonatype.org/content/repositories/snapshots/</url>
<snapshots><enabled>true</enabled></snapshots>

</repository>
<repository>
<repository>
<id>scala-tools.org</id>
<name>Scala-tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</repository>
<repository>
<id>ApacheReleases</id>
<name>Apache release repository</name>
<url>https://repository.apache.org/content/repositories/releases</url>
</repository>
</repositories>

<dependencies>
<dependency>
<groupId>com.yammer.metrics</groupId>
Expand All @@ -32,35 +35,35 @@

<!-- Scala version is very important. Luckily the plugin warns you if you don't specify:
[WARNING] you don't define org.scala-lang:scala-library as a dependency of the project -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.9.0-1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.6</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.10</version>
<type>jar</type>
<scope>test</scope>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<scope>compile</scope>
<version>0.10.0</version>
</dependency>
<dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.9.0-1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.6</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.10</version>
<type>jar</type>
<scope>test</scope>
<optional>true</optional>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.2</version>
<artifactId>hadoop-client</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.pig</groupId>
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix the indentation. It is pretty screwed up here

<artifactId>pig</artifactId>
<scope>compile</scope>
<version>0.12.0</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-math</artifactId>
Expand All @@ -76,6 +79,16 @@
<artifactId>commons-cli</artifactId>
<version>1.2</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-math3</artifactId>
<version>3.1</version>
</dependency>
<dependency>
<groupId>gov.sandia.foundry</groupId>
<artifactId>gov-sandia-cognition-learning-core</artifactId>
<version>3.3.3</version>
</dependency>
</dependencies>

<build>
Expand Down
177 changes: 177 additions & 0 deletions scripts/convert_to_mm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
from optparse import OptionParser
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file can be ignored right now. It is specific to parsing a dataset

import csv
import simplejson

DELIM = '\t'

FEATURE_JSON_FORMAT = ("{\n"+
" file_name : <file_location>\n"+
" delim : <delimiter, default = \\t>\n"+
" num : <num_users>\n"+
" delete_cols : [<List of columns to not consider>]\n"+
" multiple_feature_delim : <default = ','>\n"+
" numerical_attr : [<list of numerical attributes>]\n"
"}")

MULTIPLE_FEATURE_DELIM = ','


def parse_command_line():
parser = OptionParser(usage="python convert_to_mm.py -g=<graph-file> -e=<num_edges>"
+ "[other optional options]")

# Information about the graph file
parser.add_option("-g", "--graph-file", action="store", type="string", dest="graph_file",
help="The file containing the graph(<inedge> <out-edge> <edge-val>")
parser.add_option("-e", "--num_edges", action="store", type="int", dest="num_edges",
help="Number of edges in the graph file")

#Information about the user file
parser.add_option("-u", "--user_file_info", action="store", type="string", dest="user_file_info",
help=("Json String containing required information about the user feature file." +
" The format of JSON is as follows: \n") + FEATURE_JSON_FORMAT)

#Information about the item file
parser.add_option("-i", "--item_file_info", action="store", type="string", dest="item_file_info",
help=("Json String containing required information about the item feature file." +
" The format of JSON is as follows: \n") + FEATURE_JSON_FORMAT)

return parser.parse_args()


def update_vertex_map_from_graph(graph_file_name, user_mapping, item_mapping):
num_edges = 0
#Go through the graph file and compute the user and item maps
uniq_user_counter = len(user_mapping) + 1
uniq_item_counter = len(item_mapping) + 1

with open(graph_file_name, 'r') as graph_file:
reader = csv.reader(graph_file, delimiter=DELIM)
for row in reader:
num_edges += 1
user = user_mapping.get(row[0], None)
if user is None:
user_mapping[row[0]] = uniq_user_counter
uniq_user_counter = uniq_user_counter + 1

item = item_mapping.get(row[1], None)
if item is None:
item_mapping[row[1]] = uniq_item_counter
uniq_item_counter = uniq_item_counter + 1

return num_edges

def convert_to_matrix_market(graph_file_name, user_mapping, item_mapping):

num_edges = update_vertex_map_from_graph(graph_file_name, user_mapping, item_mapping)

with open(graph_file_name, 'r') as graph_file:
out_file = open(graph_file_name + ".mm", 'w')
out_file.write("%%MatrixMarket matrix coordinate real general\n");
out_file.write("% Generated on <DATE>\n");
out_file.write(str(len(user_mapping)) + ' ' + str(len(item_mapping)) + ' ' + str(num_edges) + '\n')

reader = csv.reader(graph_file, delimiter=DELIM)
for row in reader:
user = user_mapping.get(row[0], None)
if user is None:
user = uniq_vertex_count
user_mapping[row[0]] = user
uniq_vertex_count = uniq_vertex_count + 1

item = item_mapping.get(row[1], None)
if item is None:
item = item_count
item_mapping[row[1]] = item
uniq_item_count = item_count + 1

out_file.write(str(user) + ' ' + str(item) + ' ' + row[2] + '\n')

return {'num_edges': num_edges, 'num_features': 0}


def parse_vertex_features(vertex_mapping, feature_file_info_str):
feature_file_info = simplejson.loads(feature_file_info_str)

multiple_feature_delim = feature_file_info.get("multiple_feature_delim", MULTIPLE_FEATURE_DELIM)

uniq_vertex_count = len(vertex_mapping) + 1
feature_count = 1
feature_mapping = {}

with open(feature_file_info["file_name"], 'r') as feature_file:
user_out_file = open(feature_file_info["file_name"] + ".conv", 'w')
reader = csv.reader(feature_file, delimiter=DELIM)
for row in reader:
vertex = vertex_mapping.get(row[0], None)

#If this vertex was not seen in the actual file.
if vertex is None:
vertex_mapping[row[0]] = uniq_vertex_count
uniq_vertex_count += 1

out_str = str(vertex_mapping[row[0]])

for i in range(1, len(row)):
if "delete_cols" in feature_file_info and i in feature_file_info.delete_cols:
continue

#Add numerical attribute
if "numerical_attr" in feature_file_info and i in feature_file_info.numerical_attr:
feature_label = feature_mapping.get((i, 0), None)
if feature_label is None:
feature_mapping[(i, 0)] = feature_count
feature_label = feature_count
feature_count += 1
out_str = out_str + DELIM + str(feature_label) + ":" + row[i]
continue

# Add categorical attribute
feature_values = row[i].split(multiple_feature_delim)
for val in feature_values:
feature_label = feature_mapping.get((i, val), None)
if feature_label is None:
feature_mapping[(i, val)] = feature_count
feature_label = feature_count
feature_count += 1
out_str = out_str + DELIM + str(feature_label) + ":1"

#Write the out_str to the output file
user_out_file.write(out_str + '\n')

return {'num_entries': len(vertex_mapping), 'num_features': feature_count}


if __name__ == "__main__":

(options, args) = parse_command_line()
#print options, args

user_mapping = {}
users_info = {}
if hasattr(options, "user_file_info"):
users_info = parse_vertex_features(user_mapping, options.user_file_info)

item_mapping = {}
items_info = {}
if hasattr( options, "item_file_info"):
items_info = parse_vertex_features(item_mapping, options.item_file_info)

graph_info = convert_to_matrix_market(options.graph_file, user_mapping, item_mapping)

with open(options.graph_file + ".info", 'w') as f:
f.write(
simplejson.dumps(
{
'num_users': len(user_mapping),
'num_user_features': users_info.get('num_features', 0),
'num_items': len(item_mapping),
'num_item_features': items_info.get('num_features', 0),
'num_edge_features': graph_info.get('num_features', 0),
'num_edges': graph_info.get('num_edges', 0),
'user_mapping': user_mapping,
'item_mapping': item_mapping
}
)
)

80 changes: 80 additions & 0 deletions scripts/lastFM_user_feature.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
from optparse import OptionParser
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file can be ignored. Specific to a dataset parsing

import csv
import simplejson

DELIM = '\t'

MULTIPLE_FEATURE_DELIM = ','

month_mapping = {"Jan":1, "Feb":2, "Mar":3, "Apr":4, "May":5, "Jun":6, "Jul":7, "Aug":8, "Sep":9, "Oct":10, "Nov":11, "Dec":12}

def parse_command_line():
parser = OptionParser(usage="python lastFM_user_feature.py -u=<user-file> -a=<age_bin_interval> -dy=<date_bin_year> -dm=<date_bin_month> -dd=<date_bin_day>"
+ "[other optional options]")

# Information about the user file
parser.add_option("-u", "--user-file", action="store", type="string", dest="user_file",
help="The file containing the user features")

# Information about the bin segmentation info
parser.add_option("-a", "--age_bin_interval", action="store", type="int", dest="age_interval",default = 5,
help="The interval of an age bin")

parser.add_option("-y", "--date_bin_year", action="store", type="int", dest="year_interval", default = 0,
help="The interval of date bin on year")

parser.add_option("-m", "--date_bin_month", action="store", type="int", dest="month_interval", default = 0,
help="The interval of date bin on month")

parser.add_option("-d", "--date_bin_day", action="store", type="int", dest="day_interval", default = 0,
help="The interval of date bin on day")

return parser.parse_args()


def date_key_conversion(date, year_interval, month_interval, day_interval):
date_format = date.replace(',',' ').split()
year = int(date_format[2])
month = month_mapping[date_format[0]]
day = int(date_format[1])
if day_interval != 0:
key = str(year) + ' ' + str(month) + ' ' + str(day / day_interval)
elif month_interval != 0:
key = str(year) + ' ' + str(month / month_interval)
elif year_interval != 0:
key = str(year / year_interval)
else: #Not specified, each day an independent bin
key = str(year) +' ' + str(month) + ' ' + str(day)
return key

def age_key_conversion(age, age_interval):
if age == '':
return age
age_numeric = int(age)
if age_interval != 0:
key = str(age_numeric / age_interval)
else:
key = age
return key

def parse_user_features(user_feature_file, age_interval, year_interval, month_interval, day_interval):

with open(user_feature_file, 'r') as feature_file:
user_out_file = open(user_feature_file + "_age"+ str(age_interval)+"_"+str(year_interval)+"y"+str(month_interval)+"m"+str(day_interval)+"d"+".conv", 'w')
reader = csv.reader(feature_file, delimiter=DELIM)
for row in reader:

age_key = age_key_conversion(row[2],age_interval)
date_key = date_key_conversion(row[4], year_interval, month_interval, day_interval)
out_str = row[0] + DELIM + row[1] + DELIM + age_key + DELIM + row[3] + DELIM + date_key

#Write the out_str to the output file
user_out_file.write(out_str + '\n')

if __name__ == "__main__":

(options, args) = parse_command_line()
#print options, args

graph_info = parse_user_features(options.user_file, options.age_interval, options.year_interval, options.month_interval, options.day_interval)

21 changes: 21 additions & 0 deletions scripts/movielens_item_features.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import sys
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be ignored

import csv

DELIM = "|"
OUT_DELIM = '\t'

if __name__ == "__main__":

with open(sys.argv[1], 'r') as user_file:
out_file = open(sys.argv[1] + ".processed", 'w')
reader = csv.reader(user_file, delimiter=DELIM)
for row in reader:
out_str = row[0] + OUT_DELIM

for i in range(5, len(row)):
if row[i] == '1':
out_str = out_str + str(i) + ","
if out_str[-1] == ',':
out_str = out_str[:-1]

out_file.write(out_str + '\n')
Loading