Creating a raider.io or wow progress clone

So I play a lot of Wow (World of Warcraft), and I’m kind of competitive (top 10 DPS mage on my server). So I use Raider.io and Wow Progress quite often. And ever so often I’m like, I wish they had feature X or Y, I guess I’ll wait for someone to implement it. But then the voice in my head says, “Wait a second, carchi8py, you’re a software engineer, your exactly the type of person who could implement something like this“.

So i’ve decided to work on a project to make a Clone of Raider.io and Wow Progress. And make it an open source project that anyone can build off of. Since I like python I’m going to write most of the code in python.

First step how do you get data from Blizzard? Turn out Blizzard make that very easy. All you need to do is go to dev.battle.net and make a Mashery Account. Looking at my Mashery the only limit i have is 100 calls a second… (damn that way higher than i thought they would give). When you create an account Blizzard will give you a Key, and a Secret, these are things you should never share with others (think of them as your password). But with this, you have everything you need to get wow data.

So my first task, can i get my character data with this information. For WOW, Blizzard appears to have two different API you can call, You can find all the documentation and try out the methods here dev.battle.net/io-docs

  • WoW Community APIs (one for each region) — This contains your character information, auctions, achievements, guilds, items, mounts, pets, pvp, quests, recipes, spells, and so on
  • WoW Game Data — This contains Mythic Keystone information

So first order of business can i pull my character information out of this. So i created a little class that can get a character information. When creating the class you give the character realm, their name, and the locale (us, eu, so on). Calling the get_player function will have Blizzard dump pretty much everything on your character. I end up getting over 260,000 characters worth of data for my mage…

import apikey
import requests

class Character:
    def __init__(self, realm, name, locale):
        self.realm = realm
        self.name = name
        self.locale = locale

    def get_player(self):
        """
        Get all the information about a Character
        :return:
        """
        url = "https://" + self.locale + \
            ".api.battle.net/wow/character/" + self.realm + "/" + self.name + \
              "?fields=items,quests,achievements,audit,progression,feed,professions,talents&?locale=en_US&apikey=" + \
        apikey.key
        return requests.get(url).text

from character import Character
from server import Server

char = Character("wyrmrest-accord", "magebears", "us")
print(char.get_player())

W00t i got character information, can i get Mythic leaderboards

import apikey
import requests

class Server:
    def __init__(self, connected_realm_id, locale):
        self.connected_realm_id = connected_realm_id
        self.locale = locale

    def get_current_leaderboards(self):
        """
        Get the list of all current leader boards for each dungeon
        
        :return: json file with link to each leaderboard for the server
        """
        url = "https://" + self.locale + \
              ".api.battle.net/data/wow/connected-realm/" + self.connected_realm_id + \
              "/mythic-leaderboard/?namespace=dynamic-us&locale=en_US&access_token=" + apikey.access
        return requests.get(url).text

W00t i got character information, can i get Mythic leaderboards.

Well i had some issues with that. To get the leaderboard for a server you need to know it Connected Realm ID. So i made a call to the GetRealmIndex, which i thought would give me my Connected Realm ID. Turn out that is the Realm ID which is not the same thing. I ended up having to call the GetConnectedRealmIndex (which only give you a link) and going through each link until i found my server

{
        "key": {
            "href": "https://us.api.battle.net/data/wow/realm/1369?namespace=dynamic-us"
        },
        "name": "Wyrmrest Accord",
        "id": 1369,
        "slug": "wyrmrest-accord"
    }

Once i have my connected realm ID (which for Wyrmrest Accord is 1171) i could make a small function like i did with characters to get all the leader boards.


import apikey
import requests

class Server:
    def __init__(self, connected_realm_id, locale):
        self.connected_realm_id = connected_realm_id
        self.locale = locale

    def get_current_leaderboards(self):
        """
        Get the list of all current leader boards for each dungeon
        
        :return: json file with link to each leaderboard for the server
        """
        url = "https://" + self.locale + \
              ".api.battle.net/data/wow/connected-realm/" + self.connected_realm_id + \
              "/mythic-leaderboard/?namespace=dynamic-us&locale=en_US&access_token=" + apikey.access
        return requests.get(url).text

With just these 2 calls we pretty much have a way to get all the data Raider IO and wow Progress has. Next step is to save this information and make a score like raider io and wow progress have.

Making the website more than a table

Ok at this point we have done the following

  • Scraped a cruise site for Cruise information
  • Scraped a flight website for flight information
  • Put all this information in a database
  • And now we are showing the information in a table

A plain HTML table is ok, but it would be really cool if we could order by any of the rows, or even search them. We could write a bunch of javascript, or we could use something someone else has already created.  In this case i found datatables.net, it will allow us to order by any row in our table, and even search our table. And the great thing we don’t have to modify our code that much

So the entire curise.html file is below, most of it stays the same, but in the head, we add both Jquery, and Datatable paths. We also add a little 4 line script telling datatables what the id of out table is and then add the id to the table itself. And that it.

<!DOCTYPE html>
<html>
<head>
    <title>Cruises</title>
    <link rel="stylesheet" type="text/css" href="//cdn.datatables.net/1.10.16/css/jquery.dataTables.css">
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>
    <script type="text/javascript" charset="utf8" src="//cdn.datatables.net/1.10.16/js/jquery.dataTables.js"></script>
    <script type="text/javascript">
        $(document).ready( function () {
        $('#table_id').DataTable();
        } );
    </script>
</head>
<body>

<table border="1" id="table_id" class="display">
    <thead>
        <tr>
            <th>ID</th>
            <th>Date</th>
            <th>Cruise Line</th>
            <th>Ship</th>
            <th>Type</th>
            <th>Cruise Price</th>
            <th>Flight Price</th>
            <th>Total</th>
            <th>Nights</th>
            <th>Price Per Night</th>
        </tr>
    </thead>
     <tbody>
    {% for cruise in cruises %}
        <tr>
            <td><a href="/itinerary/{{cruise.id}}">{{cruise.id}}</a></td>
            <td>{{cruise.date}}</td>
            <td>{{cruise.line.name}}</td>
            <td><a href="/ships/{{cruise.ship_id}}">{{cruise.ship.name}}</td>
            <td>{{cruise.departs.name}}</td>
            <td>{{cruise.price}}</td>
            <td>{{cruise.total - cruise.price}}</td>
            <td>{{cruise.total}}</td>
            <td>{{cruise.nights}}</td>
            <td>{{cruise.total/cruise.nights}}</td>
        </tr>
    {% endfor %}
     </tbody>
</table>
</body>
</html>

In this is the end this is what it looks like, i can search for 2017, and get a list of all cruise that are from 2017, and then i can order them from price pre day, and i can see the cheapest cruise is 15 day $1199 out of San Francisco.

Screen Shot 2017-10-29 at 4.42.44 PM

Adding Ships and Flights to my cheap cruise finder

Most of the project so far has been working on the cruise part. At this point we have Cruise, prices, where they stop, what ship they are on, what cruise line. Now to add 2 new things.

  1. Ship information. how many rooms, bar, when was it last refurbished. This is good information to know if you’re booking a cruise.
  2. Flight information. There two large cost on a cruise, the cruise itself, and the flight there. Maybe the cruise out of Vancouver is 200 bucks cheaper than the one from Seattle, but if the round trip flight are 500 bucks more than you’d save 300 on the Seattle one.

Ship Information

This one is pretty easy Affordabletours has a page for every ship with a url in a pretty standard format /<cruise line>_Cruises/<ship name>. So i made a new file ship.py that get this url, and parses the information i care about in this case

  • Year the ship was built
  • Year the ship was Refurbished
  • The number of Crew
  • The Passager capacity
  • How many bars
  • How many Pools
  • How many Casinos

And that pretty much it, This script run after all the cruises have been put in the database, since we only have to run it once for each ship (and not once for each cruise)

Flight Information

Flight information is a little harder, Affordabletours doesn’t have flight information. The first place i wanted to grab information from was Google flights. I decided against it because it would be harder to get this information, and i found a APi that get it information from google so i used that instead.

It still possible to get information from google flight though, google doesn’t have it data in html form, it injected with javascript, so you’ll have to use Selenium to render it and pass that on to beautiful soup. On top of that Google generates, random class names on every reload, so you’d have to figure out a way around that.

In this case i use Skypicker’s Api, which has documentation so it very easy to get what you want. For what we need we only need 2 things

  • The Skypicker id for an airport (we use the location api) https://locations10.docs.apiary.io/#
  • The Skypicker flight api that will get us a flight between to location https://skypickerpublicapi.docs.apiary.io/#reference/flights/get

 

I won’t go over the entire fight.py files, but you can see it here https://github.com/carchi8py/cheap-cruise-finder/blob/0.2/src/flights.py

The main function looks like this.
Grab every cruise in our database.
For every single cruise, find the start and end date of the cruise and grab the port we are in on these days. Pass this name to the get_airport_code function to get skypickers ID for this location
Once we have this information get get the flight

TEST_AIRPORT = "SFO"

def main():
    cruises = db.session.query(Cruise).order_by(Cruise.nights).all()
    for cruise in cruises:
        to_flight = None
        from_flight = None
        start_date = cruise.date
        start_airport = get_airport_code(cruise.departs.name)
        days = cruise.days
        end_airport = get_airport_code(days[len(days) -1 ].port.name)
        end_date = days[len(days) -1 ].date
        if TEST_AIRPORT != start_airport:
            to_flight = generate_flight_url(start_date, TEST_AIRPORT, start_airport)
            to_flight_data = get_flight_details(to_flight)
            add_to_cruise(to_flight_data, cruise)
        if TEST_AIRPORT != end_airport:
            from_flight = generate_flight_url(end_date, end_airport, TEST_AIRPORT)
            from_flight_data = get_flight_details(from_flight)
            add_to_cruise(from_flight_data, cruise)

And with this we have flight information for every cruise.

Adding documentation with Sphinx

Up until this point, i haven’t been adding any documentation to my code. But now that we have a very small working example, using static analysis, and viewing out coverage, it time to add documentation as well. In python i use Sphinx for documentation. This is the same package that make Python documentation, and a lot of other python documentation.

I’m not going to go into how to set up sphinx on your own project, but you can watch this video which show you all the steps you need. https://www.youtube.com/watch?v=qrcj7sVuvUA
But a short version of what i do to generate documentation after running the quickstart i run the following in my documentation directory

sphinx-apidoc -o ./rst ../src/
cp conf.py rst/
cp modules.rst index.rst
sphinx-build -b html rst/ html/

If we look through our documentation we can see all our function there, but since I’ve written no documentation, they are pretty much empty.

Screen Shot 2017-10-19 at 5.14.21 PM

 

So let right some simple documentation

def add_cruiseline(cruise_line):
    """
    commit a cruise Line to the cruise line table in the database

    :param cruise_line: The name of the Cruise line
    :return: nothing
    """
    commit(CruiseLine(name = cruise_line))

I use pycharm as my python IDE and what is nice if i start a documentation string (3 “) in a row, it will generate the basic structure that sphinx look at when creating documentation (the :param and :return). All i have to do is fill in the rest my self. A line describing the function, a line for each parameter and what the function returns. You can also do much more complicated thing like highlighting and showing code example (check sphinx docs for that). Since my code is very simple right now there is no need for me to do that.

Now that i’ve added documentation if i run sphinx-build again i’ll see my documentation (i switched themes as well). You can see all the source code on github on branch 0.1.4

Screen Shot 2017-10-19 at 5.33.18 PM

Creating a Unit Test in Python

So it a good practice to keep your unit test coverage above a certain %. SonarCloud quality profile defaults to 80%, which i think is a good benchmark. So I want to add coverage to our Çurise file. Well that pretty easy to do, we have a function add_to_db that take in the data we generated in the function get_cruise_data, so lets Fake some data, and then in our test pass this data to the add_to_db function. This is going to do a few things.

  1. Creates a fake Cruise Line called Carchi8py Super Cruise
  2. Creates a fake ship called The Carchi8py
  3. Creates a port called Earth
  4. And a Cruise with all this information
#Fake Cruise variables
FAKE_DATE = "Oct 17, 1990"
FAKE_LINE = "Carchi8py Super curise"
FAKE_SHIP = "The Carchi8py"
FAKE_DESTINATION = "Mars"
FAKE_DEPARTS = "Earth"
FAKE_NIGHTS = "2000"
FAKE_PRICE = "$999,999,999,999"

class testAddCruiseToDatabase(unittest.TestCase):
    def test_adding_to_database(self):
        fake_cruise = [FAKE_DATE, FAKE_LINE, FAKE_SHIP, FAKE_DESTINATION, FAKE_DEPARTS, FAKE_NIGHTS, FAKE_PRICE]
        cruise.add_to_db(fake_cruise)
        test_cruise = session.query(Cruise).filter_by(nights=FAKE_NIGHTS).one()
        self.assertEqual(test_cruise.line.name, FAKE_LINE)
        self.assertEqual(test_cruise.ship.name, FAKE_SHIP)
        self.assertEqual(test_cruise.destination, FAKE_DESTINATION)
        self.assertEqual(test_cruise.departs.name, FAKE_DEPARTS)
        self.assertEqual(test_cruise.nights, int(FAKE_NIGHTS))

We do a quick check out to see if this data is there and were done… or are we. The problem with a test like this is that we’ve added data to our database, We don’t want that so we are going to need to create a function to remove this data, and test those functions to make sure they work. In our cruise file, we want to add a function that reverts the add we did. In this case, we are going to remove thing in the opposite order that we added it. (Other than Cruise which has all 3 other tables as keys, it doesn’t matter which order you delete in)

def remove_from_db(cruise_data):
    db_delete(session.query(Cruise).filter_by(nights = cruise_data[5]).first())
    print("Removing Cruise")
    db_delete(session.query(Port).filter_by(name = cruise_data[4]).first())
    print("Removing Port")
    db_delete(session.query(Ship).filter_by(name=cruise_data[2]).first())
    print("Removing Ship")
    db_delete(session.query(CruiseLine).filter_by(name=cruise_data[1]).first())
    print("Removing CruiseLine")

We go back to our function in our test class that added the fake cruise and add some line to delete it. Unlike the first test where we only needed to make 1 database call. We will need to make a call to each database to make sure all rows have been deleted. A

    def test_adding_to_database(self):
        fake_cruise = [FAKE_DATE, FAKE_LINE, FAKE_SHIP, FAKE_DESTINATION, FAKE_DEPARTS, FAKE_NIGHTS, FAKE_PRICE]
        cruise.add_to_db(fake_cruise)
        test_cruise = session.query(Cruise).filter_by(nights=FAKE_NIGHTS).one()
        self.assertEqual(test_cruise.line.name, FAKE_LINE)
        self.assertEqual(test_cruise.ship.name, FAKE_SHIP)
        self.assertEqual(test_cruise.destination, FAKE_DESTINATION)
        self.assertEqual(test_cruise.departs.name, FAKE_DEPARTS)
        self.assertEqual(test_cruise.nights, int(FAKE_NIGHTS))
        cruise.remove_from_db(fake_cruise)
        test_cruise = session.query(Cruise).filter_by(nights=FAKE_NIGHTS).first()
        self.assertIsNone(test_cruise)
        test_ship = session.query(Ship).filter_by(name=FAKE_SHIP).first()
        self.assertIsNone(test_ship)
        test_line = session.query(CruiseLine).filter_by(name=FAKE_LINE).first()
        self.assertIsNone(test_line)
        test_port = session.query(Port).filter_by(name=FAKE_DEPARTS).first()
        self.assertIsNone(test_port)

Running the Make Coverage script from the last post, we can see we’ve increased our coverage to 82%, and we can see we are added the fake curise to the data, and then removing it as the test past with out any issues.

carchi (0.1.3) cheap-cruise-finder $ ./make_coverage.sh 
Adding CruiseLine Carchi8py Super curise to database
Adding Ship The Carchi8py to database
Adding Port Earth to database
Adding cruise ['Oct 17, 1990', 'Carchi8py Super curise', 'The Carchi8py', 'Mars', 'Earth', '2000', '$999,999,999,999'] to database
Removing Cruise
Removing Port
Removing Ship
Removing CruiseLine
.
----------------------------------------------------------------------
Ran 1 test in 0.079s

OK
Name                        Stmts   Miss  Cover
-----------------------------------------------
src/cruise.py                  84     22    74%
src/database_setup.py          32      0   100%
src/tests/cruise_test.py       42      0   100%
src/tests/website_test.py       6      0   100%
src/website.py                 27     12    56%
-----------------------------------------------
TOTAL                         191     34    82%

Now doing a sonar run and we can see now that are new coverage

Screen Shot 2017-10-16 at 5.08.25 PM

Clicking on the coverage metric we’ll get a list of file, and then by clicking on a file, we are able to see coverage for each line. here we can see the line that are missing coverage in our main function.

Screen Shot 2017-10-16 at 5.12.02 PM

Static analysis and Coverage for python

Before we move forward with the project, now that we have a small site up and running, we need to talk about a few thing I’ve purposely skipped. The first is Static analysis and Coverage. For both of these, I use a tool called SonarQube, it a great tool for finding out problem area in your code (in pretty much any language), and seeing how it changes over time. SoanrCloud.io is a public version you can use if you’re working on an open source project.

If we go to my project page so far we can see quite a bit, first, I have some issues I should fix, 1 Vulnerabilities (these are things people could use to take over the site), 6 bugs, and 2 Code Smell (style type issues). If I had coverage, coverage would appear as a line as well, we’ll fix that later

Screen Shot 2017-10-11 at 4.44.48 PM

So let fix some of these bugs, clicking on any of the numbers will take me to the issue page where i can see the issues i have, in this case, my HTML a DOCTYPE and i don’t have title tags, so let me go ahead and fix these

Screen Shot 2017-10-11 at 4.48.55 PM

To run SonarCloud they have a great page on the site explain how to run your code through it depending on the language your using, as some languages are easier than others.

Now that we fixed our issues, let add some coverage. For python i use coverage.py which is easy to use and easy to get Sonarcloud to understand.

I create a small shell script at the root directory of the project call make_coveage.sh, it calls our python unit test files (which I’ll coverage later), it removes coverage from /usr/ so we don’t get files we don’t own, combine the report, and create an xml file that sonar can read.

rm -rf .coverage
coverage run --parallel-mode --omit "/usr/*" src/tests/cruise_test.py
coverage run --parallel-mode --omit "/usr/*" src/tests/website_test.py
coverage combine
coverage report
coverage xml

Next, i make the unit test file, which right now will just import the file i want to test, right now i’m not going to write any test so all the function in these files will not get called and show up as not having coverage. Because both Website and Cruise import database_setup it going to appear as having 100% coverage even though we didn’t write any test for it

import os
import sys

BASE_DIR = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
if BASE_DIR not in sys.path:
    sys.path.insert(0, BASE_DIR)

import cruise

I then run ./make_coverage.sh to make the coverage.

carchi (0.1.3) cheap-cruise-finder $ ./make_coverage.sh 
Name                        Stmts   Miss  Cover
-----------------------------------------------
src/cruise.py                  71     46    35%
src/database_setup.py          32      0   100%
src/tests/cruise_test.py        6      0   100%
src/tests/website_test.py       6      0   100%
src/website.py                 30     12    60%
-----------------------------------------------
TOTAL                         145     58    60%

After the test have run i can see my project is currently at about 60% coverage, but how can that be, we haven’t written any test? Coverage is showing us the line that has been executed. In this case when we import src/cruise.py we are importing all the code from /src/database_setup.py, it not telling us a test on it or not. So We will need to keep that in mine even if we have 100% coverage it doesn’t mean every line has a test that ran against it, just every line was hit while we ran coverage.

Now if we add this line

sonar.python.coverage.reportPath=coverage.xml

to our sonar properties file and rerun sonarcloud we will now see our coverage. Clicking on it we can will be able to see the exact line that are covered or not covered in our code.
Screen Shot 2017-10-11 at 5.28.43 PM

Flask: Creating a simple website with using sqlalchemy

From the previous post we have scrapped a website using Beautiful Soup to get cruise data, and we have stored that data in a simple database. Now we want to show this data on a website that shows us the cheapest cruise. There 2 python package i know that do this pretty well. Django, and Flask. Django is a complete web framework, it has pretty much everything you’d want for putting together a website with python on the backend. Flask is a much lighter weight package (they even call it a microframework). For this project right now i’m going to start off by using Flask, if i feel this is going to become more than just a fun side project, i might switch over to Django.

Making simple pages with Flask is super easy, so let make our first page. So i want to make a page that at whatever our domain is /lines/. Using the @app.route notation. This tells flask that the function below this should be called when someone goes to /lines/. For this page, i just want to list alphabetically all Cruise lines. session.query(CruiseLine) tell sqlalchemy to return all columns of the CuriseLine table (if we did CruiseLine.name we would just get the names back). .order_by(CruiseLine.name) This tells sqlalchemy i want to order these by name.

@app.route('/lines/')
def curiseLines():
    lines = session.query(CruiseLine).order_by(CruiseLine.name)
    return render_template('lines.html', lines=lines)

The second line tells flask to pass lines to the lines.html file in our template directory. We’ll need a directory called template in our src directory for this to work. Our HTML file will include just HTML and a few simple Python Command that FLask will run when the page is loaded. In this case, we have a for loop going through each element of lines and printing the name, followed by a line break

<html>
<body>
{% for line in lines %}
    {{line.name}}
{% endfor %}
</body>
</html>

When we go to the page we get an alphabetical list of cruise lines that are in our database

Screen Shot 2017-10-11 at 4.21.32 PM

So that cool, but let do something a little more difficult. Let list all cruises by the cheapest price pre day in an html table, show all the information we know about the cruise. This sound like it going to be much more difficult, as now we are using all 4 tables in the database, and will be showing 7 fields worth of data, but it turns out it not much more code than what we had above.

The python code is pretty much the same, except we are wordering by price divided by the number of nights, if the cost is 500 bucks and it a 5 day curise we end up with 100 buck per night
[pyhton]
@app.route(‘/cruises/’)
def cruiseByPricePreDay():
cruises = session.query(Cruise).order_by(Cruise.price/Cruise.nights)
return render_template(‘cruise.html’, cruises=cruises)
[\python]

The HTML is also pretty simple as well. Now since we made a relationship in the cruise table with the other table, we can easily get the data we need. Cruise.line point to a specific element in the Cruiseline table. cruise.line.name grab the name from the curiseline table that the line foreign key in our Cruise table points to

<html>
<body>
<table border="1">
    <tr>
        <th>Date</th>
        <th>Cruise Line</th>
        <th>Ship</th>
        <th>Departs</th>
        <th>Price</th>
        <th>Nights</th>
        <th>Price Per Night</th>
    </tr>
    {% for cruise in cruises %}
        <tr>
            <td>{{cruise.date}}</td>
            <td>{{cruise.line.name}}</td>
            <td>{{cruise.ship.name}}</td>
            <td>{{cruise.departs.name}}</td>
            <td>{{cruise.price}}</td>
            <td>{{cruise.nights}}</td>
            <td>{{cruise.price/cruise.nights}}</td>
        </tr>
    {% endfor %}
</table>
</body>
</html>

Looking at the website now we can see we have cruises to Hawaii order by the cheapest price, and the cheapest one is just under 70 bucks a night, (not bad). At this point we have all the basic building block we need, next step, going more indepth to make this an attual product.

Screen Shot 2017-10-11 at 4.32.52 PM

You can see the code that was used to make the website on my github page in branch 0.1.2.

sqlalchemy: Adding information to the database

So we have our data scrape, and we have a database, now we need to make a method to add the data to the database, before we can use it in a website.

Since our Cruise table is depended on the CruiseLine, Ship, and Port table these need to be filled in first. Now since we are using ‘id’ as our keys we will want to make sure we are not re-adding the same data multiple times. I could have avoided this by making the name the key, assuming it was unique. But since I don’t know what will be in the data I decided against this. So i’ve written a line like this:

session.query(exists().where(CruiseLine.name == cruise_data[1])).scalar()

This line checks to see if in the CruiseLine table if there is a name that matches the name we are trying to add. If there is one, we know the cruise line exists in the database so we don’t need to add it again

I do the same thing for all tables until I get to Cruise. I tried Cruise.ship.name but got an error, not 100% sure what I did wrong (if you have any idea let me know). So what I ended up doing was looking at the other data fields, (Date, night, and destination), It not perfect as if there are more than 1 cruise going to the same place for the same number of nights on the same date, it won’t get added. But for now, this is fine I’ll just remember to come back and fix this later.

<pre>session.query(exists().where(CruiseLine.name == cruise_data[1])).scalar()</pre><pre>
def add_to_db(cruise_data):
    #Check to see if a cruise line exists
    if not session.query(exists().where(CruiseLine.name == cruise_data[1])).scalar():
        add_cruiseline(cruise_data[1])
        print("Adding CruiseLine %s to database" % cruise_data[1])
    if not session.query(exists().where(Ship.name == cruise_data[2])).scalar():
        add_ship(cruise_data[2])
        print("Adding Ship %s to database" % cruise_data[2])
    if not session.query(exists().where(Port.name == cruise_data[4])).scalar():
        add_port(cruise_data[4])
        print("Adding Port %s to database" % cruise_data[4])
    date_obj = datetime.datetime.strptime(cruise_data[0], "%b %d, %Y").date()
    if not session.query(exists().where(Cruise.date == date_obj and \
                                        Cruise.nights == cruise_data[5] and \
                                        Cruise.destination == cruise_data[3])).scalar():
        add_cruise(cruise_data, date_obj)
        print("Adding cruise %s to database" % str(cruise_data))

def add_cruiseline(cruise_line):
    commit(CruiseLine(name = cruise_line))

def add_ship(ship):
    commit(Ship(name = ship))

def add_port(port):
    commit(Port(name = port))

def add_cruise(cruise_data, date_obj):
    line_obj = session.query(CruiseLine).filter_by(name = cruise_data[1]).one()
    ship_obj = session.query(Ship).filter_by(name = cruise_data[2]).one()
    port_obj = session.query(Port).filter_by(name = cruise_data[4]).one()
    new_curise = Cruise(date = date_obj,
                        line = line_obj,
                        ship = ship_obj,
                        destination = cruise_data[3],
                        departs = port_obj,
                        nights = cruise_data[5],
                        price = cruise_data[6])
    commit(new_curise)

def commit(query):
    session.add(query)
    session.commit()

Now this code again isn’t perfect. Prices won’t update when thing change, so we will need to remember to go back to add that functionality. But for now we have scraped data from a website and added it to a database. So we can next use this data to show on a simple website.

To view all changes i made to add cruise to the database check out branch 0.1.1 on github

sqlalchemy: Setting up a Database

From the last post, we were able to get scrape data using Beautiful Soup. Now we can go straight to making a website for this data, but it probably better to store it in a database so that our web page load faster. So what data do we have?

  • Date of the cruise
  • The cruise line
  • The cruise ship
  • The Destination
  • The departure port
  • How many night
  • the Price of the cruise

For Databases in Python i like using Sqlalchemy, it very powerful but easy to use, and a lot of other people use it so you can easily find examples online of pretty much anything you plan to do. The first thing we are going to have to do is figure out how we want to lay out our database tables. We could have a single table called Cruises, but in the future, we might want to include ship information. If we have everything in 1 table we would have to repeat the same data over and over again.

  • So looking at the data above I believe the following should be it own table:
    Cruise Line should be it own table, we may want to include a link, rating, other things of that nature that apply only to the Cruise line.
  • The Cruise ship we might want to know the size of the ship, special feature, the number of bars, the average rating people give the specific ship
  • The Departure port, where is it, what are the nearest airports, what do people think about the port.

So i end up with 3 tables line this. All the currently have are names, but in the future, we’ll add more information to them. The last table instead of calling it departure port, I’ve made a more generic term port, so that in the future if we include all the port that cruise ship stops at we’ll be able to put them in the same table. Each table has a primary key, this is a number that represent that row in the database table.

class CruiseLine(Base):
    __tablename__ = 'cruiseline'
    id = Column(Integer, primary_key=True)
    name = Column(String(128), nullable=False)

class Ship(Base):
    __tablename__ = 'ship'
    id = Column(Integer, primary_key=True)
    name = Column(String(128), nullable=False)

class Port(Base):
    __tablename__ = "port"
    id = Column(Integer, primary_key=True)
    name = Column(String(256), nullable=False)

The last step is making our cruise table which will use the rest of the information, and have a relationship to the 3 table above. We make this relationship using the relationship function in sqlalchemy. We also list another variable that contain the ForeignKey to the other database table. This will be useful when we make the full website

class Curise(Base):
    __tablename__ = 'curise'
    id = Column(Integer, primary_key=True)
    date = Column(Date, nullable=False)
    line_id = Column(Integer, ForeignKey('cruiseline.id'))
    line = relationship(CruiseLine)
    ship_id = Column(Integer, ForeignKey('ship.id'))
    ship = relationship(Ship)
    destination = Column(String(256))
    departs_id = Column(Integer, ForeignKey('port.id'))
    departs = relationship(Port)
    nights = Column(Integer)
    price = Column(Integer)

You can view the full code for the database here in Github, to create the database all you need to do is run

python3 database_setup.py

And the database will be created.

Beautiful Soup: Scraping website to get Cruise information

To start this project we need to get data for Cruises and flight. I want to work on cruises first, and after looking there aren’t to many travel website that show cruises. I was able to find Expedia and Affordable Tours listing cruises. Affordable Tours had a much simpler interface, so we will probably have better luck using Beautiful Soup with.

Beautiful Soup is a great python package for grabbing data out of HTML easily. There are a few different ways you can use Beautiful soup, because everything I need on the Affordable Tours website is in the HTML I’m going to use requests to grab the HTML. This way will not work if you’re trying to grab information off a website that uses Javascript to insert the data. For that, you’ll want to use Beautiful Soup with selenium. I have some example of Code here for those interested in how that works.

So to start with I want to start with a page that already exists to make sure I can scrape it before I do something harder. So i’ve started off with a list of all cruises out of Hawaii. The URL looks like this

https://cruises.affordabletours.com/search/advanced_search/?destination=21&departuredate=&cruiseline1=&numnights=0&dport1=&resident=

The URL is broken up into 2 parts.

  • The actual page: https://cruises.affordabletours.com/search/advanced_search/
  • And the parameters: destination=21&departuredate=&cruiseline1=&numnights=0&dport1=&resident=

So in my code i’m going to set up some variables for these. For now i’m going to hardcode the 21 since that equals Hawaii, in the future, i will make it an option that is passed in on a website or something of that nature. The PARAMS variable set up the url parameters since i only care about location i’m throwing the other away.

DESTINATION = "21"
PARAMS = {"destination": DESTINATION}
URL = "https://cruises.affordabletours.com/search/advanced_search"

Next step is calling our URL, getting the data and passing it to BeautifulSoup. We do that will the following lines of code. The first line get the URL with the params we passed it. The URL that is generated looks like this https://cruises.affordabletours.com/search/advanced_search/?destination=21 The second line get the html text that was returned and passes it to BeautifulSoup’s html parser

    r = requests.get(URL, params=PARAMS)
    soup = BeautifulSoup(r.text, "html.parser")

Now that we have our soup we need to know what we are looking for. I use chrome inspect tool for this, we want to find the element that we are looking for. In this case i find that the return results are in a table with the class (search-results)

chrome1

So i use soup find feature to find a table HTML tag, which has a class “search-results” in this case there is only one. Results is an object just like soup, but now only contains the html code for this specific table. Now since we know we are in a table, and each row of the table has a TR, i want to get every single TR in the table. I use the findall function. This will give me every single cruise listed on the page

    results = soup.find("table", {"class": "search-results"})
    cruises = results.findAll("tr")

Lastly, i want to get the data for each tr in the table. to do this i create a for loop. The first row in the table is the header so we will stip that, for each other row we are going to grab the different columns, date, line, ship, destination, departs, night, prices and print the data out

    for cruise in cruises[1:]:
        date = cruise.find("td", {"class": "table-date"}).text
        line = cruise.find("td", {"class": "table-line"}).text
        ship = cruise.find("td", {"class": "table-ship"}).text
        destination = cruise.find("td", {"class": "table-destination"}).text
        departs = cruise.find("td", {"class": "table-departs"}).text
        nights = cruise.find("td", {"class": "table-nights"}).text
        price = cruise.find("td", {"class": "table-price"}).text
        print(date, line, ship, destination, departs, nights, price)

Putting it all together you end up with a very simple script that will parse the website and return the data

from bs4 import BeautifulSoup
import requests
import sys

#To start with i'll hard code 21 (with is hawaiian cruises).
DESTINATION = "21"
PARAMS = {"destination": DESTINATION}
URL = "https://cruises.affordabletours.com/search/advanced_search"

def main():
    r = requests.get(URL, params=PARAMS)
    soup = BeautifulSoup(r.text, "html.parser")
    results = soup.find("table", {"class": "search-results"})
    cruises = results.findAll("tr")
    # the first cruise row is the header of the table, which we don't care about so we will skip them
    for cruise in cruises[1:]:
        date = cruise.find("td", {"class": "table-date"}).text
        line = cruise.find("td", {"class": "table-line"}).text
        ship = cruise.find("td", {"class": "table-ship"}).text
        destination = cruise.find("td", {"class": "table-destination"}).text
        departs = cruise.find("td", {"class": "table-departs"}).text
        nights = cruise.find("td", {"class": "table-nights"}).text
        price = cruise.find("td", {"class": "table-price"}).text
        print(date, line, ship, destination, departs, nights, price)

if __name__ == "__main__":
    main()

When you run the script you end up with a list of hawaiian cruises, date, curise line, the ship, the destination, what port they leave, the number of nihgt, and the price.

carchi (0.1) src $ python3 cruise.py 
Oct 05, 2017 Ponant Le Boreal Tahiti Honolulu 15 $7,715
Oct 06, 2017 Royal Caribbean Explorer of the Seas Australia/New Zealand Seattle 22 $1,822
Oct 08, 2017 Princess Emerald Princess Hawaii Los Angeles (San Pedro) 28 $2,599
Oct 14, 2017 Seabourn  Sojourn Hawaii Los Angeles (San Pedro) 32 $9,686