This has been generated by the StormCrawler Maven Archetype as a starting point for building your own crawler.
Have a look at the code and resources and modify them to your heart's content. 

# Prerequisites

You need to install Apache Storm. The instructions on [setting up a Storm cluster](https://storm.apache.org/releases/2.6.2/Setting-up-a-Storm-cluster.html) should help. Alternatively, 
the [stormcrawler-docker](https://github.com/DigitalPebble/stormcrawler-docker) project contains resources for running Apache Storm on Docker. 

You also need to have an instance of URLFrontier running. See [the URLFrontier README](https://github.com/crawler-commons/url-frontier/tree/master/service); the easiest way is to use Docker, like so:

```
docker pull crawlercommons/url-frontier
docker run --rm --name frontier -p 7071:7071  crawlercommons/url-frontier
```

# Compilation

Generate an uberjar with

``` sh
mvn clean package
```

# URL injection

The next step is to inject URLs into URLFrontier, using the [client](https://github.com/crawler-commons/url-frontier/tree/master/client). Fortunately, it is added as a dependency to this project so all
you need to do is

``` sh
java -cp  target/${artifactId}-${version}.jar crawlercommons.urlfrontier.client.Client PutURLs -f seeds.txt
```

where _seeds.txt_ is a file containing URLs to inject, with one URL per line.

# Running the crawl

You can now submit the topology using the storm command:

``` sh
storm local target/${artifactId}-${version}.jar --local-ttl 60 ${package}.CrawlTopology -- -conf crawler-conf.yaml
```

This will run the topology in local mode for 60 seconds. Simply use the 'storm jar' to start the topology in distributed mode, where it will run indefinitely.

You can also use Flux to do the same:

``` sh
storm local target/${artifactId}-${version}.jar  org.apache.storm.flux.Flux crawler.flux --local-ttl 3600
```

Note that in local mode, Flux uses a default TTL for the topology of 20 secs. The command above runs the topology for 1 hour.

It is best to run the topology with `storm jar` to benefit from the Storm UI and logging. In that case, the topology runs continuously, as intended.
