Distributed Search Engine

This is a class project for CIS 555 Internet & Web Systems (Spring 2021) at University of Pennsylvania.

My team built a distributed search engine, with 4 components:

Using AWS cloud storage services such as S3 and DynamoDB, as well as Apache Spark, Apache Storm, Amazon EMR, all components are able to run in a distributed setting. I mainly worked on the indexer that reads document contents and create an inverted index to be used by the search engine and ranking algorithm.

See our project architecture below, and report here with implementation and optimizations details.

DGCCA Architecture
Search Engine Architecture