πŸ“¦ vinta / albedo

A recommender system for discovering GitHub repos, built with Apache Spark

β˜… 185 stars β‘‚ 40 forks πŸ‘ 185 watching βš–οΈ MIT License
apache-sparkelasticsearchfeature-engineeringmachine-learningpythonrecommender-systemscala
πŸ“₯ Clone https://github.com/vinta/albedo.git
HTTPS git clone https://github.com/vinta/albedo.git
SSH git clone git@github.com:vinta/albedo.git
CLI gh repo clone vinta/albedo
Vinta Chen Vinta Chen Merge pull request #6 from vinta/dependabot/pip/django-1.11.29 be94cad 5 years ago πŸ“ History
πŸ“‚ master View all commits β†’
πŸ“ .docker-assets
πŸ“ .idea
πŸ“ albedo
πŸ“ app
πŸ“ src
πŸ“„ .dockerignore
πŸ“„ .gitignore
πŸ“„ albedo.iml
πŸ“„ Dockerfile
πŸ“„ LICENSE
πŸ“„ log4j.properties
πŸ“„ Makefile
πŸ“„ manage.py
πŸ“„ pom.xml
πŸ“„ README.md
πŸ“„ requirements.txt
πŸ“„ README.md

Albedo ======

A recommender system for discovering GitHub repos, built with Apache Spark.

Albedo is a fictional character in Dan Simmons's Hyperion Cantos series. Councilor Albedo is the TechnoCore's AI advisor to the Hegemony of Man.

Setup

$ git clone https://github.com/vinta/albedo.git
$ cd albedo
$ make up

Collect Data

You need to create your own GITHUB_PERSONAL_TOKEN on your GitHub settings page.

# get into the main container
$ make attach

# this step might take a few hours to complete
# depends on how many repos you starred and how many users you followed
$ (container) python manage.py migrate
$ (container) python manage.py collect_data -t GITHUB_PERSONAL_TOKEN -u GITHUB_USERNAME
# or
$ (container) wget https://s3-ap-northeast-1.amazonaws.com/files.albedo.one/albedo.sql
$ (container) mysql -h mysql -u root -p123 albedo < albedo.sql

# username: albedo
# password: hyperion
$ make run
$ open http://127.0.0.1:8000/admin/

Start a Spark Cluster

You could also create a Spark cluster on Google Cloud Dataproc.

# start a local Spark cluster in Standalone mode
$ make spark_start

Use Popularity as the Recommendation Baseline

See PopularityRecommenderBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.PopularityRecommenderTrainer \
    target/albedo-1.0.0-SNAPSHOT.jar
# NDCG@30 = 0.002017744675282716

Build the User Profile for Feature Engineering

See UserProfileBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.UserProfileBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar

Build the Item Profile for Feature Engineering

See RepoProfileBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.RepoProfileBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar

Train an ALS Model for Candidate Generation

See ALSRecommenderBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.ALSRecommenderBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar
# NDCG@30 = 0.05209047292612741

Build a Content-based Recommender for Candidate Generation

Elasticsearch's More Like This API will do the tricks.

$ (container) python manage.py sync_data_to_es

See ContentRecommenderBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,org.apache.httpcomponents:httpclient:4.5.2,org.elasticsearch.client:elasticsearch-rest-high-level-client:5.6.2,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.ContentRecommenderBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar
# NDCG@30 = 0.002559563451967487

Train a Word2Vec Model for Text Vectorization

See Word2VecCorpusBuilder.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.Word2VecCorpusBuilder \
    target/albedo-1.0.0-SNAPSHOT.jar

Train a Logistic Regression Model for Ranking

See LogisticRegressionRanker.scala for complete code.

$ spark-submit \
    --master spark://localhost:7077 \
    --packages "com.github.fommil.netlib:all:1.1.2,com.hankcs:hanlp:portable-1.3.4,mysql:mysql-connector-java:5.1.41" \
    --class ws.vinta.albedo.LogisticRegressionRanker \
    target/albedo-1.0.0-SNAPSHOT.jar
# NDCG@30 = 0.021114356461615493

TODO

  • Build a recommender system with Spark: Factorization Machine
  • Build a recommender system with Spark: GDBT for Feature Learning
  • Build a recommender system with Spark: Item2Vec
  • Build a recommender system with Spark: PageRank and GraphX
  • Build a recommender system with Spark: XGBoost

Related Posts