greenbay.usc.edu · web viewfirst_name text, middle_name text, last_name text, full_name text not...

Technical Manual

Soccer Data Web Crawler

Team No. 02

First Name Last Name Role

Trupti Sardesai Project Manager

Wenchen Tu Prototyper

Subessware Selvameena Karunamoorthy System/Software Architect

Pranshu Kumar Requirements Engineer

Zhitao Zhou Feasibility Analyst

Yan Zhang Operational Concept Engineer

Qing Hu Life Cycle Planner

Amir ali Tahmasebi Shaper

Version History

Date Author Version Changes made Rationale

12/01/14SK, WT, ZZ, PK, YZ

1.0 First Version of this Document For DDC package

12/08/14 ZZ 2.0 Second version of this Document Change some index

Table of Contents

Technical Manual..........................................................................................................................................................iVersion History.............................................................................................................................................................iiTable of Contents.........................................................................................................................................................iiiTable of Figures.............................................................................................................................................................11.Introduction................................................................................................................................................................22. Installation Guideline...............................................................................................................................................22.1 Prerequisite Software.............................................................................................................................................22.2 Software Installation...............................................................................................................................................22.2.1 Scrapy....................................................................................................................................................................22.2.2 Facebook-SDK......................................................................................................................................................22.2.3 Psycopg2................................................................................................................................................................22.2.4 Python 2.7.............................................................................................................................................................22.2.5 pip..........................................................................................................................................................................22.2.6 PostgreSQL database...........................................................................................................................................22.2.7 Ruby on Rails.......................................................................................................................................................32.2.8 Tweepy..................................................................................................................................................................33. Modules......................................................................................................................................................................33.1 Crawler....................................................................................................................................................................33.1.1 Scrapy....................................................................................................................................................................33.1.1.1 Structure............................................................................................................................................................33.1.1.2 items.py..............................................................................................................................................................33.1.1.3 spiderName.py...................................................................................................................................................43.1.1.4 Pipelines.py........................................................................................................................................................83.1.2 BeautifulSoup.......................................................................................................................................................83.1.2.1 Program Flow....................................................................................................................................................83.1.2.2 Implementation.................................................................................................................................................93.2 Facebook................................................................................................................................................................103.2.1Structure..............................................................................................................................................................103.2.1.1 Classes..............................................................................................................................................................103.2.2 Workflow............................................................................................................................................................143.2.2.1 General Program Flow...................................................................................................................................143.2.2.2 postsParser.parse workflow...........................................................................................................................153.2.3 Q & A..................................................................................................................................................................163.3 Twitter....................................................................................................................................................................163.3.1.Structure.............................................................................................................................................................173.3.2 Workflow............................................................................................................................................................223.3.2.1 General Program Flow...................................................................................................................................223.3.2.2 Get Function workflow...................................................................................................................................233.3.3 Q & A..................................................................................................................................................................233. 4 Data Ingestion.......................................................................................................................................................243.4.1 Connect to Database..........................................................................................................................................253.4.2 Create custom dictionary..................................................................................................................................263.4.3 Identify duplicates..............................................................................................................................................273.4.4 Insert into Database...........................................................................................................................................283.4.5.1 Tasks Tab.........................................................................................................................................................293.5 Scheduler...............................................................................................................................................................30

Technical Manual Version 2.0

Table of Figures

Figure 1: Program workflow for Spider.........................................................................................................................9

Figure 2: General Program Flow for Facebook..........................................................................................................14

Figure 3: Post workflow...............................................................................................................................................15

Figure 4: Twitter Snapshot...........................................................................................................................................19

Figure 5: General Program flow for Twitter................................................................................................................22

Figure 6: Get function workflow...................................................................................................................................23

TM_T02_F14a_V2.0 1 Date: 12/08/14


1.IntroductionThis manual will explain implementation details of our product.

2. Installation Guideline

2.1 Prerequisite SoftwareBefore install main model, you should have the following things installed.

● Python 2.7● pip and setup tools Python packages.● PostgreSQL database, with valid credentials including username, password, host, port

and database name.● Ruby on rails 2.0● Tweepy

2.2 Software Installation2.2.1 ScrapyYou could use pip to install Scrapy. The command is ‘pip install Scrapy’.

2.2.2 Facebook-SDKTo install facebook-sdk, the only correct way is to download the latest package from github link https://github.com/pythonforfacebook/facebook-sdk, and use command ‘python setup.py install’.

Warning: don’t use ‘pip install facebook-sdk’ to install the facebook-sdk, it will cause the code broken.

2.2.3 Psycopg2

Usually Psycopg2 is installed with PostgreSQL database. If not, you can use pip to install Psycopg2 by pip install psycopg2.

2.2.4 Python 2.7To install Python 2.7, you need to first download the lastest version of Python 2.7 from the official website https://www.python.org/. Then follow the installation instructions from https://docs.python.org/2/, according to the operating systems that you are using.

2.2.5 pipTo install pip, please go to https://pip.pypa.io/en/latest/installing.html and follow the instructions, according to the operating systems that you are using.

2.2.6 PostgreSQL database

TM_T02_F14a_V2.0 2 Date: 12/08/14

https://pip.pypa.io/en/latest/installing.html


To install PostgreSQL database, please go to http://www.postgresql.org/download/ , choose the operating system that you are using, go to that link and follow the instructions.

2.2.7 Ruby on RailsTo install ruby on rails, please go to http://rubyinstaller.org/downloads/ , download ruby2.0.0 and the development kit according th eoperating system and follow the instructions.

2.2.8 Tweepy You could use pip to install Tweepy. The command is ‘pip install Tweepy’.Tweepy includes Stream, OAuthHandler and Stream Listener. So we don’t to install these modules explicitly.

3. ModulesThe software contains three modules, crawler, Facebook and Twitter.

3.1 CrawlerThe crawler is implemented in two ways, using Scrapy and using beautifulSoup.

3.1.1 Scrapy

3.1.1.1 Structure

The file structure of Scrapy project is shown as follows.projectName/

scrapy.cfgprojectName/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py spiderName.py

The part you need to modify is items.py and spiderName.py.

3.1.1.2 items.pyIn items.py, we define the items. Items are containers that will be loaded with the scraped data; they work like simple python dictionary but provide additional protection against populating undeclared fields, to prevent typos.

TM_T02_F14a_V2.0 3 Date: 12/08/14


They are declared by creating a scrapy.Item class and defining its attributes as scrapy.Field objects, like you will in an ORM.

Here is a sample item definition in items.py.

class USLPlayerStat(scrapy.Item): ln = scrapy.Field() fn = scrapy.Field() seq = scrapy.Field() personkey = scrapy.Field() team = scrapy.Field() jersey = scrapy.Field() midname = scrapy.Field() nickname = scrapy.Field() birthdate = scrapy.Field() regnum = scrapy.Field() pos1 = scrapy.Field() pos2 = scrapy.Field() GP = scrapy.Field() Points = scrapy.Field() Min = scrapy.Field() G = scrapy.Field() A = scrapy.Field() S = scrapy.Field() F = scrapy.Field() pending = scrapy.Field()

We define fields for each of attributes we want to capture from the websites. For USLPlayerStat object shown in the example, we capture every attribute the USL website provides for player statistics.

3.1.1.3 spiderName.pyWe define the spider in spiderName.py. Spiders are user-written classes used to scrape information from a domain (or a group of domains). The Spider define an initial list of URLs to download, how to follow links, and how to parse the contents of those pages to extract items defined in items.py.

To create a Spider, we must subclass scrapy.Spider and define the three main mandatory attributes.

● name: identifies the Spider. It must be unique, that is, you can’t set the same name for different Spiders. It will be the parameter we use to invoke this spider.

● start_urls: is a list of URLs where the Spider will begin to crawl from. So, the first pages downloaded will be those listed here. The subsequent URLs will be generated successively from data contained in the start URLs.

TM_T02_F14a_V2.0 4 Date: 12/08/14


● parse() is a method of the spider, which will be called with the downloaded Response object of each start URL. The response is passed to the method as the first and only argument.

This method is responsible for parsing the response data and extracting scraped data (as scraped items) and more URLs to follow.The parse() method is in charge of processing the response and returning scraped data (as Item objects) and more URLs to follow (as Request objects).

A sample Spider file is given as follows.class linksSpider(scrapy.Spider): name = 'usl' allowed_domains = ['uslpro.uslsoccer.com'] start_urls = ['http://uslpro.uslsoccer.com/players/index_E.html',

'http://uslpro.uslsoccer.com/standings/index_E.html']

def parse(self, response): table1 = response.xpath('//table//table//table//table//table//table') statslinks = table1.xpath("tbody/tr/td[last()]//a/@href") for url in statslinks: yield scrapy.Request(url.extract(), callback=self.parse)

pattern = re.escape("$j('div#indicator').fadeIn();") + '\s*url\s*=\s*(.*);' match = response.re(pattern) if len(match) >= 2: rosUrl = match[0] rosUrlStr = urlparse.urljoin(response.url, string.replace(rosUrl, "'", "")) statUrl = match[1] statUrlStr = urlparse.urljoin(response.url, string.replace(statUrl, "'", "")) yield scrapy.Request(rosUrlStr, callback=self.parse) yield scrapy.Request(statUrlStr, callback=self.parse)

table2 = response.xpath('//table//table//table//table//table//table') trs = table2.xpath("//tr[@class='tms']") attrs = ["OverallPts", "OverallGP", "OverallW", "OverallL", "OverallT", "OverallGF",

"OverallGA", "HomeW", "HomeL", "HomeT", "HomeGF", "HomeGA", "AwayW", "AwayL", "AwayT", "AwayGF", "AwayGA"]

for tr in trs: team = USLTeam() teamname = tr.xpath("td[@class = 'stub']/a/text()")[0].extract() teamname = string.replace(teamname, "\r\n", "") teamname = teamname.strip() team['TeamName'] = teamname i = 0 for td in tr.xpath("td[not (@class)]/text()"):

TM_T02_F14a_V2.0 5 Date: 12/08/14


team[attrs[i]] = td.extract() i = i +1 yield team

if response.url.endswith('ros.js'): rosstr = response.xpath("//body/p/text()")[0].extract().replace("\r\n", "").replace("\n",

"") rosplayers = json.loads(rosstr) for id, player in rosplayers["players"].items(): rositem = USLPlayerRos() for key, value in player.items(): rositem[key] = value yield rositem

if response.url.endswith('stat.js'): statstr = response.xpath("//body/p/text()")[0].extract().replace("\r\n", "").replace("\n",

"") statplayers = json.loads(statstr) for id, player in statplayers["players"].items(): statitem = USLPlayerStat() for key, value in player.items(): statitem[key] = value yield statitem

In parse() function, the scrapy.http.Response object is the content of the website fetched by Scrapy.

There are several ways to extract data from web pages. Scrapy supports using Xpath, CSS expressions or Regular expression to abstract data, respectively invoking xpath(), css(), re() method. Usually when we use xpath(), css(), re() method, the return object is a selector. If we want to extract the data inside the object, we need to use extract() method. For example, response.xpath('//ul/li/text()').extract() abstracts the data string inside li tag.

In the example above, inside parse function, we first extract team website urls contained in the start page. We use yield scrapy.Request(url.extract(), callback=self.parse) to add the urls abstracted from start url page into url list the Scrapy will crawl. The first part code is shown as follows.

table1 = response.xpath('//table//table//table//table//table//table')statslinks = table1.xpath("tbody/tr/td[last()]//a/@href")for url in statslinks:

yield scrapy.Request(url.extract(), callback=self.parse)

Since USL fetches the team data through separate request for JS file and then load the data into team website when loading the team website page. We further extract URLs of JS files invoked by USL when loading team website page. Then we tell Scrapy to crawl these links by invoking TM_T02_F14a_V2.0 6 Date: 12/08/14


scrapy.Request() method. Here we use the regular expression to abstract the url link. The following code is correspond to this process.

pattern = re.escape("$j('div#indicator').fadeIn();") + '\s*url\s*=\s*(.*);' match = response.re(pattern) if len(match) >= 2: rosUrl = match[0] rosUrlStr = urlparse.urljoin(response.url, string.replace(rosUrl, "'", "")) statUrl = match[1] statUrlStr = urlparse.urljoin(response.url, string.replace(statUrl, "'", "")) yield scrapy.Request(rosUrlStr, callback=self.parse) yield scrapy.Request(statUrlStr, callback=self.parse)

Now we are ready to crawl the player data once the team roster JS file and player statistic JS file are fetched and parsed by Scrapy. We use xpath to gather the JSON string inside the JS file. Then we use json model of Python to loads the JSON String into JSON object. We create the item object, namely USLPlayerRos or USLPlayerStat here, for each player, assign value to corresponding attribute and yield the item object. The corresponding code is shown as follows.

if response.url.endswith('ros.js'): rosstr = response.xpath("//body/p/text()")[0].extract().replace("\r\n", "").replace("\n", "") rosplayers = json.loads(rosstr) for id, player in rosplayers["players"].items(): rositem = USLPlayerRos() for key, value in player.items(): rositem[key] = value yield rositem

if response.url.endswith('stat.js'): statstr = response.xpath("//body/p/text()")[0].extract().replace("\r\n", "").replace("\n", "") statplayers = json.loads(statstr) for id, player in statplayers["players"].items(): statitem = USLPlayerStat() for key, value in player.items(): statitem[key] = value yield statitem

For USL, the team information is stored in Standings page indicated by URL of http://uslpro.uslsoccer.com/standings/index_E.html. So we also add this URL into our start_urls list. Extracting team information is quite easy compared to gather player information on USL.We use xpath to gather each record information of the table displayed in Standings page. For each record representing a separate team, we create a team item object (USLTeam), assign the value to corresponding attributes and finally yield the item object. Following is the code.

table2 = response.xpath('//table//table//table//table//table//table') trs = table2.xpath("//tr[@class='tms']") attrs = ["OverallPts", "OverallGP", "OverallW", "OverallL", "OverallT", "OverallGF", TM_T02_F14a_V2.0 7 Date: 12/08/14


"OverallGA", "HomeW", "HomeL", "HomeT", "HomeGF", "HomeGA", "AwayW", "AwayL", "AwayT", "AwayGF", "AwayGA"] for tr in trs: team = USLTeam() teamname = tr.xpath("td[@class = 'stub']/a/text()")[0].extract() teamname = string.replace(teamname, "\r\n", "") teamname = teamname.strip() team['TeamName'] = teamname i = 0 for td in tr.xpath("td[not (@class)]/text()"): team[attrs[i]] = td.extract() i = i +1 yield team

3.1.1.4 Pipelines.pyAdd the required pipelines to settings.py.

ITEM_PIPELINES = ['webcrawler.pipelines.WebcrawlerPipeline',]The “yield” statement in the spider will invoke the pipeline. The pipeline passes the item to the data ingestion controller.class WebcrawlerPipeline(object): def process_item(self, item, spider): name = type(item).__name__ change_dict_key(item, name)Identify the name of the spider to differentiate the source of data. The name will be the name set in SpiderName.py. We need source of data to identify duplicates from different websites.

3.1.2 BeautifulSoup

BeautifulSoup is used for crawling the following 2 websites:

http://www.topdrawersoccer.com/

http://www.mlssoccer.com/

3.1.2.1 Program Flow

TM_T02_F14a_V2.0 8 Date: 12/08/14


Figure 1: Program workflow for Spider

3.1.2.2 ImplementationIn this section, we give an example of the implementations of the mls crawler and topdrawer based on the soccer website:

http://www.mlssoccer.com/

To get the source code of a webpage, we use urllib2 module to get the url of the webpage:

req = urllib2.Request(url="http://www.mlssoccer.com/players?page="+str(i),headers=headers)

try:

handle = urllib2.urlopen(req).read();

except IOError, e:

print "url open mistake."

return True

After we get the source code of an web page, we can use beautifulsoup module to construct a beautifulsoup object from the source code page.

soup = BeautifulSoup(handle, "html5lib")

The beautifulsoup object contains a tree of elements from the source html page, we can traverse the elements tree to find the player data elements, parse the string values of those elements and extra player data.TM_T02_F14a_V2.0 9 Date: 12/08/14


for player in soup.find("table").find("tbody").find_all("tr"):

attlist=player.find_all("td",recursive=False)

Position = attlist[1].string.strip(' ').strip('\n').strip(' ')

Player_Name = attlist[2].a.string.strip(' ')

Club = attlist[3].string.strip(' ').strip('\n').strip(' ')

Age = attlist[4].string.strip(' ').strip('\n').strip(' ')

Heightin = attlist[5].string.strip(' ').strip('\n').strip(' ').strip("\"")

… …

3.2 Facebook3.2.1Structure3.2.1.1 ClassesAuthorizer

● DescriptionGet access token for facebook component, which is required to access public profile of

players.

● AttributeFACEBOOK_APP_IDThe Facebook application id for this application, need to be replaced with SportechBI

contractor’s application id.FACEBOOK_APP_SECRET The Facebook application secret for this application, need to be replaced with SportechBI contractor’s application secret.

● MethodAuthorizer()This function will create an Authorizer object.

string get_access_token()This function will return the access_token generated from FACEBOOK_APP_ID and

FACEBOOK_APP_SECRET.

SearchParser● Description

Fetch the list of possible page_id list for a player by name.● Attribute

counterThe index of current entry in the list of page_id in Facebook.

TM_T02_F14a_V2.0 10 Date: 12/08/14


dataThe list of page_id in Facebook.

● MethodSearchParser(graph, name, team)This function will create a SearchParser Object. Parameter graph is the graph object return by

Facebook API. Parameter name is the player’s name. Parameter team is the player’s team.boolean isEmpty()This function will return whether we have arrived at the last entry in the list of page_id in

Facebook. If we have reached the last entry, return True, otherwise, return False.dict pop()This function will return the next unvisited item in the list of page_id in Facebook.The return

type is dict type in Python. If we have accessed the last entry in the list, then return None, otherwise return current entry.

VerifiedPageChecker

● DescriptionThis class will check whether a page_id is a verified page_id in Facebook.

● AttributeresponseThe dict object return by Facebook API.

● MethodVerifiedPageChecker(graph, pageid)This function will create a VerifiedPageChecker Object. Parameter graph is the graph object

returned by Facebook API. Parameter pageid is the pageid of a player.boolean isVerified()This function will return whether the page_id is verified.

correctPageChecker

● DescriptionThis class will handle all the process of get correct id for unverified player.

● AttributegraphThe graph object returned by Facebook API.keywordThe keyword list which is used to check whether a player is a soccer player.teamThe player’s teamresponseThe dict type of data which contains the information of a player.● MethodcorrectPageChecker(graph)

TM_T02_F14a_V2.0 11 Date: 12/08/14


Paramter graph is the graph object returned by Facebook API.void addKeyword(keyword)This function will add a specific keyword to the keyword list. Parameter keyword is the

keyword whose data type should be string.void removeKeyword(keyword)This function will remove a specific keyword to the keyword list. Parameter keyword is the

keyword whose data type should be string.void load(pageid, team)This function will refresh the response attribute of the object for the new pageid. Parameter

pageid is the new pageid, data type should be string. Parameter team is the team of the new pageid, data type should be string.

boolean isCorrect()This function will return whether the pageid passed to the object is the correct id for a player.

The function will use the keyword in the keyword list to decide whether a pageid is correct. If some keywords appeared in the about, description, personal_info and username section of a player, then the pageid can be considered the correct pageid. For example, a keyword in the keyword list is ‘soccer’, if this keyword appeared in the description, then he/she must be a soccer player.

pageParser

● DescriptionThis class will get the key attribute talking_about and likes of player from the profile.

● AttributegraphThe graph object return by facebook API.responseThe dict data which contains the public profile for a player.

● MethodpageParser(graph, pageid)This method will create a new pageParser object. Parameter graph is the graph object

returned by Facebook API. Parameter pageid is the pageid for the player to get information from.

int getTalkingAbout()This function will return the number of talking about for a player page.int getLikes()This function will return the number of likes for a player page.void parse()This function will return the number of talkingabout and likes in a dict object. To access the

talking about data, use key ‘talking_about_count’. To access the likes data, use ‘likes’ key.

postsParser

● Description

TM_T02_F14a_V2.0 12 Date: 12/08/14


This class will handle the requests of retrieving all the posts within six months.

● AttributegraphThe graph object returned by Facebook API.limitThe limit number of posts or comments requested from Facebook each time.pageidThe pageid to be parsed.responseThe dict data type which contains all the information.stampThe unix timestamp of six months ago.

● Method postsParser(graph, pageid)Create a new postsParser object.getNext() This function will return the link indicating to next page of posts. If no such link exists, then

return None.parse()This function will parse the message of the post, the number of likes of the post and the

contents and number of all the posts under that post and return as a dict type of data.The result will return as a dict object.

postParser

● DescriptionThis class will handle the request of access the number of likes and comments, the content of

each comment and the content of the post for each postid.

● AttributegraphThe graph object returned by Facebook API.limitThe limit number of posts or comments requested from Facebook each time.postidThe postid to be parsedresponseThe dict data type which contains all the information.stampThe unix timestamp of six months ago.

● Method postParser(graph, postid)Create a new postParser object.

TM_T02_F14a_V2.0 13 Date: 12/08/14


getNext() This function will return the link indicating to next page of comment. If no such link exists,

then return None.parse(last)This function will parse the message of the post, the number of likes of the post and the

contents and number of all the posts under that post and return as a dict type of data. Parameter last is a list type data, for recording whether we have get the last required item. If last[0] is True, then this item is older than six months from now, we don’t need to record that in the database

3.2.2 Workflow3.2.2.1 General Program Flow

Figure 2: General Program Flow for Facebook

TM_T02_F14a_V2.0 14 Date: 12/08/14


3.2.2.2 postsParser.parse workflow

TM_T02_F14a_V2.0 15 Date: 12/08/14


Figure 3: Post workflow

3.2.3 Q & A● How to change the limit number of requests for each query of comments or posts?

To change limits for posts, go to function __init__ in class postsParser and change the limit attribute self.limit to number you want. The maximum number here is 250.To change limits for comments, go to function __init__ in class postParser and change the limit attribute self.limit to number you want. Here i don’t know the maximum number it permits.

● How to customize the keyword list in correctPageChecker?Add line in function __initKeywordList in class correctPageChecker, use the function

self.keyword.append(keyword)● How do I implement the function get correct page id for a player?

In a player’s profile, there are several fields that we can use to tell whether a page id belongs to a football player. They are about, description, personal_info,username. from these fields, if the string contains some keyword like footballer, or soccer, then we can know that this page id belongs to a soccer player.

● How to enable and disable time stamp?If a developer want to disable the timestamp and retrieve all the data from a player’s

public profile, then commenting the source code in myFacebookv3.py from line 203 and 204, 207 and 208, 252 and 253, 302 to 304, then you will obtain the program which is able to fetch all the history data.

● What if I want to change the time limit to 6 months to some other time?Go to file myFacebookv3.py and change the timestamp from line 228~237.

3.3 Twitter

import sys - The sys module contains system-level information, such as the version of Python you're running (sys.version or sys.version_info), and system-level options such as the maximum allowed recursion depth (sys.getrecursionlimit() and sys.setrecursionlimit()).sys.modules is a dictionary containing all the modules that have ever been imported since Python was started; the key is the module name, the value is the module object. Note that this is more than just the modules your program has imported. Python preloads some modules on startup, and if you're using a Python IDE, sys.modules contains all the modules imported by all the programs you've run within the IDE.As new modules are imported, they are added to sys.modules. This explains why importing the TM_T02_F14a_V2.0 16 Date: 12/08/14


same module twice is very fast: Python has already loaded and cached the module in sys.modules, so importing the second time is simply a dictionary lookup.

import time – This module provides various time-related functions.

3.3.1.Structure

Authorizer Get access token for twitter component, which is required to access public profile of players.

OAuth Authentication to access twitter –

AttributeTWITTER_APP_KEYThe Twitter application id for this application, need to be replaced with SportechBI

contractor’s application id.TWITTER_APP_SECRET The Twitter application secret for this application, need to be replaced with SportechBI contractor’s application secret.TWITTER_ACCESS_TOKENThe Twitter access token for this application, need to be replaced with SportechBI contractor’s access token.TWITTER_ACCESS_TOKEN_SECRETThe Twitter access token secret for this application, need to be replaced with SportechBI contractor’s access token secret.

You need to register the application with twitter to get access rights to crawl through twitter.Here is the link where one can register there own app to get access to twitter.https://apps.twitter.com/Just a simple step-by-step procedure and you can get your own app key and aces token. The four things needed here are app key, app key secret, access token and access token secret.

I have already registered an application in the name of STBI. Here are the four things for the app I registered –

TWITTER_APP_KEY = '8VQLr43DgYHK7TzMKU7Tjfvyk' TWITTER_APP_KEY_SECRET = TM_T02_F14a_V2.0 17 Date: 12/08/14

https://apps.twitter.com/


'pFpmKQFNswqmTDzcgcoqjYS1sl4nNW8yvdbKL18nLxkkukiwcW' TWITTER_ACCESS_TOKEN = '113962350-4dYWw6ypAejH9VAIKvGCleF36yyhroW41wVz1iRd'TWITTER_ACCESS_TOKEN_SECRET = 'IZpkFS7ZWD0CZJkV2o1Pr2yqEwIb4AxiqeqnGNzNFAIme'

These can be used too but it is strongly recommended that you register your own app.

Here are few screenshots that may help you in the registration process.

TM_T02_F14a_V2.0 18 Date: 12/08/14


Figure 4: Twitter Snapshot

class TweetListener(StreamListener): A listener handles tweets are the received from the stream. This is a basic listener that just prints received tweets to standard output .

tweepy.API(auth)TM_T02_F14a_V2.0 19 Date: 12/08/14


This function will return the access_token generated from TWITTER_APP_KEY and TWITTER_APP_KEY_SECRET.

twitterStreamTwitter Stream is a simple plugin designed to simply show a users Twitter timeline. It includes file caching to stop overuse of Twitter's API

myusers [] – A list that stores all the screen names that have a verified accounts.otherusers[]– A list that stores all the screen names that have a unverified accounts.

others[]- Contains all the possible users with that screen name.actors[]- Contains the list of all the users imported form the database.

count– A variable to maintain the count of all the verified accounts.count1 A variable to maintain the count of all the unverified accounts.

api.search_users -Provides a simple, relevance-based search interface to public user accounts on Twitter.

u.verified - check if an account is verified or not for a user.

myusers.append(u.screen_name) -if we get a verified account we just add that screen name for that user to the myuser list.

others.append(u.screen_name) - All the screen names that are not verified for that user name are stored in others list.

user = api.get_user(player)- define user to get tweets for accepts input from the database.

t=user.followers_count – stores the number of follower for a user in the variable t.

otherusers.append(myplayer)-put all the user that don’t have a verified account and have the maximum number of followers for a given username in otherusers list.

for player in otherusers -

TM_T02_F14a_V2.0 20 Date: 12/08/14


loop that goes through a list of otherusers that are the accounts that have the maximum number of followers and are unverified accounts.

for player in myusers -loop that goes through a list of myusers that are the accounts that are verified accounts.

time.sleep(5)-put the api.search method to sleep for 5 seconds to parse the rate limit that is put up by the API. We access each account every 5 seconds and get a the details for that account.

user = api.get_user(player)-This function gets the user information using the given username and stores it in a list user.

timeline = api.user_timeline(screen_name=user.screen_name, include_rts=True, count=1)-user_timeline is used to get to the timeline of a user and get all his details from the timeline. Screen name is the username. Rts includes the retweets if it is set to true. Count gives the number of tweets that needs to be retrieved from a user timeline. A maximum of 3200 tweets can be retrieved.

user.screen_name - print screen name

user.name - print username

user.id - print twitter unique ID

user.created_at - print the date account was created

user.description- print player description

user.followers_count- print number of followers

user.statuses_count - print number of tweets

user.url - print Players webite URL if specifies

tweet.id- print unique tweet ID

tweet.text- print the text of the tweet

tweet.created_at- print the date and time when the tweet was created

TM_T02_F14a_V2.0 21 Date: 12/08/14


tweet.retweeted - print retweeted returns 1 or 0

tweet.retweet_count- print retweet count

tweet.source - print source from where tweet was posted tweet.favorited- print favorite count

3.3.2 Workflow3.3.2.1 General Program Flow

Figure 5: General Program flow for Twitter

TM_T02_F14a_V2.0 22 Date: 12/08/14


3.3.2.2 Get Function workflow

Figure 6: Get function workflow

3.3.3 Q & A

What is the limit to the number user that can be searched using the API ?

Search will be rate limited at 180 queries per 15-minute window. A friendly reminder that search queries will need to be authenticated in version 1.1.

How to change the limit for number of tweets?

TM_T02_F14a_V2.0 23 Date: 12/08/14


To change limits for tweets, go to api.user_timeline(screen_name=user.screen_name, include_rts=True, count=N) and change the count attribute to the number you want. The maximum number here is 3200.

Where can I check the rate limit for various GET functions of the API?

Go to https://dev.twitter.com/rest/public/rate-limits

3. 4 Data IngestionTables:

Player - holds details of the players identified from various websites. There are columns that correspond to the fields available in the websites. If a new website has to be crawled, their corresponding fields should be added to the database schema.

first_name text, middle_name text, last_name text, full_name text NOT NULL,gender VARCHAR(6),source text NOT NULL, team_name text,division text,conference text, goals NUMERIC, height REAL, heightin text, weight NUMERIC, yellow NUMERIC, red NUMERIC, assist NUMERIC, duplicate NUMERIC, date_of_birth text, age NUMERIC, seasons text, positions text,position_1 text,sequence_val text, person_key text, jersey text, register_num NUMERIC, num NUMERIC, last_team text, approved text, nick_name text, games_played NUMERIC, game_starts NUMERIC, sub_ins NUMERIC, points NUMERIC, shots NUMERIC,shots_on_target NUMERIC,foul_committed NUMERIC, foul_suffered NUMERIC, fouls NUMERIC, pending text, minutes NUMERIC, saves NUMERIC, goals_conceded NUMERIC, country text, win NUMERIC, lose NUMERIC, draw NUMERIC, status text,r_shots_on_goals NUMERIC,

Team - holds team details identified from various websites.There are columns that correspond to the fields available in the websites. If a new website has to be crawled, their corresponding fields should be added to the database schema.

team_name text NOT NULL, goals NUMERIC, fouls NUMERIC, yellow NUMERIC, red NUMERIC, source text NOT NULL, duplicate NUMERIC, games_played NUMERIC, shots NUMERIC,points NUMERIC, wins NUMERIC, losses NUMERIC, ties NUMERIC, games_for NUMERIC, games_against NUMERIC, home_wins NUMERIC, home_losses NUMERIC, home_ties NUMERIC, home_games_for NUMERIC ,home_games_against NUMERIC, away_wins NUMERIC, away_losses NUMERIC, away_ties NUMERIC, away_games_for NUMERIC, away_games_against NUMERIC,team_id text, tournament text, home_games_played NUMERIC, home_points NUMERIC, away_games_played NUMERIC, TM_T02_F14a_V2.0 24 Date: 12/08/14

https://dev.twitter.com/rest/public/rate-limits


away_points NUMERIC, goal_difference NUMERIC, home_goal_difference NUMERIC, away_goal_difference NUMERIC,streaks NUMERIC, home_streaks NUMERIC, away_streaks NUMERIC,r_games_played NUMERIC,r_goals NUMERIC,r_assists NUMERIC,r_shots NUMERIC,r_shots_on_goals NUMERIC,r_fouls_committed NUMERIC, r_fouls_suffered NUMERIC,r_offsides NUMERIC,r_corner_kick NUMERIC,r_penality_kick_goals NUMERIC,r_penality_kick_attempts NUMERIC,p_games_played NUMERIC,p_goals NUMERIC, p_assists NUMERIC,p_shots NUMERIC,p_shots_on_goals NUMERIC,p_fouls_committed NUMERIC,p_fouls_suffered NUMERIC,p_offsides NUMERIC,p_corner_kick NUMERIC, p_penality_kick_goals NUMERIC,p_penality_kick_attempts NUMERIC,Largest_Attendance text,Lowest_Attendance text,Most_Home_Goals text,Most_Away_Goals text, Largest_Margin_of_Victory text,Average_Attendance text,Aggregated_Attendance text,Longest_Losing_Streak text,Longest_Unbeaten_Streak text,Longest_Winless_Streak text, Longest_Winning_Streak text,scorer text[],assister text[],discs text[], players text[],year NUMERIC

Facebook - The facebook data is stored in two tables with the post id relating the tables,

The player’s basic facebook details are stored in facebook table.

full_name text, team text, facebook_id text, likes NUMERIC, talking_about NUMERIC, post_count NUMERIC, post_ids text[]

The data corresponding to each post are stored in facebook_post table.

post_id text, Status text, likes numeric, comment_count NUMERIC, comments text[]

Twitter - The twitter data is in two tables with full name relating the tables.

The twitter data stores the basic twitter information.

full_name text, screen_name text, twitter_id text,created date, description text, num_of_followers numeric, num_of_tweets text, website_URL text

The twitter_tweet table stores information about each tweet

full_name text, tweet_id text, tweet text, created date, retweets text, retweet_count text, tweet_source text

3.4.1 Connect to DatabaseThe postgreSQl requires database name, username, password and host. There is a config file that holds these data. To change connection to a different database, please change the details in this file.

[DbInfo]

database_name = webscraping

host = localhost

TM_T02_F14a_V2.0 25 Date: 12/08/14


user = postgres

password = postgres

port = 5432

Python has the ConfigParser library that helps us to open this config file and read the connection details.

CONFIG_FILE = 'C:\Users\Suba\workspace\webcrawler_pro\webcrawler\Scheduler\config.cfg'

DB_INFO_SECTION = 'DbInfo'

config = ConfigParser.ConfigParser()

config.read(CONFIG_FILE)

host = config.get(DB_INFO_SECTION, 'host')

database_name = config.get(DB_INFO_SECTION, 'database_name')

user = config.get(DB_INFO_SECTION, 'user')

password = config.get(DB_INFO_SECTION, 'password')

conn_string = "host=\'"+ host + "\' dbname=\'" + database_name +'\' user=\'' + user + '\' password=\''+ password + '\''

Use the psycopg2 library to establish connection to the database.

conn = psycopg2.connect(conn_string)

3.4.2 Create custom dictionaryWe are creating a custom dictionary so that the column names of the databse matches the python’s dictionary keys. This way, psycopg2 lets us to insert the dictionary into respective columns of database.

custom_dictionary = {

'Yel': 'yellow',

'Ast': 'assist',

'Player': 'full_name',

'Name': 'full_name',

TM_T02_F14a_V2.0 26 Date: 12/08/14


'Born': 'date_of_birth',

'DateofBirth': 'date_of_birth',

'Country': 'country',

'Age': 'age',

'Height': 'height',

'Weight': 'weight',

}

If there is a Yel in the python dictionary that resulted from crawling the websites that key will be changed to yellow. So if the database schema is changed, this file has to be changed to map the columns to corresponding data.

3.4.3 Identify duplicatesThere is a probability that same player might be present in different websites. But since the websites only share name as the common field, we identify if a player with the same name exists in the database. If a player exists, the duplicate records will be identified. SporTech B.I contractors will manually check for such records in the duplicate, correlate it with the websites and then decide if the data are really duplicate or they are different players who happen to have the same name.

cursor.execute("SELECT * FROM PLAYER WHERE FULL_NAME = '"+ row_data['full_name']+"'")

duplicate_rows = cursor.fetchall()

if cursor.rowcount <= 0:

insert_into_db(row_data)

else:

for row in duplicate_rows:

if row['source'] == row_data['source']:

update_cursor.execute("delete from player where source ='" + row['source']+"' and full_name ='"+ row_data['full_name'] +"'")

insert_into_db(row_data)

elif row['source'] == 'USLPlayerStat' and row_data['source'] == 'USLPlayerRos':

update_cursor.execute("update player set last_team = %s , height = %s , weight = %s ,approved = %s where source = %s and full_name = %s" ,

TM_T02_F14a_V2.0 27 Date: 12/08/14


(row_data['last_team'] , row_data['height'], row_data['weight'], row_data ['approved'], row['source'], row['full_name'] ))

elif row['source'] == 'USLPlayerRos' and row_data['source'] == 'USLPlayerStat':

update_cursor.execute('update player set nick_name = %s , games_played = %s , goals = %s , points = %s , minutes = %s, fouls =%s , shots= %s, assist = %s, pending= %s where source = %s and full_name = %s',

(row_data['nick_name'], row_data['games_played'], row_data['goals'], row_data ['points'], row_data['minutes'], row_data['fouls'], row_data['shots'], row_data['assist'], row_data['pending'], row['source'], row_data['full_name'] ) )

else:

row_data['duplicate'] = '1'

USLPro website has a special case, where the same player data is present in two different pages in the website(player statistics and roster). We can differentiate between the statistics and roster data based on the spider name. This scenario alone is handled as an update to the same name, because we know that the website presents same player dat in two different pages.

The same is done for team details too. The team name is used as key to check duplicates in that scenario.

3.4.4 Insert into DatabaseWe can identify the columns to be populated for each website based on the python dictionary created as a result of crawling. From this dictionary we are dynamically creating the query string. The source (spider name) that is a key in the dictionary is used to identify the source. Based on source we insert the data into the corresponding tables.

row_data is the python dictionary.

if row_data['source'] in player_spider_names:

for key in row_data:

column_data += '%(' +key + ')s'

column_name += key + ','

column_data += ','

column_data = column_data[0:-1]

column_name = column_name[0:-1]

cursor.execute("INSERT INTO PLAYER("+column_name+") VALUES (" + column_data+ ")", row_data)TM_T02_F14a_V2.0 28 Date: 12/08/14


3.4.5 Developer User Interface

This allows to developer to schedule tasks. The tasks are to invoke a crawler at a specified time. There is also a provision to run a particular now.

From the terminal open the folder that contains the website code. Enter the command "rails s". It creates a small web server named WEBrick which comes bundled with Ruby. For more information about rails command please visit http://guides.rubyonrails.org/command_line.html. Then in your browser go to URL - http://localhost:3000/tasks. This will open the tasks page.

There are two tabs in the UI Tasks and Players.Here localhost can be replaced with the server URL will be deployed to.

3.4.5.1 Tasks TabThe following are the fields in the Tasks tab.

Name - Name defined for a specific task

Frequency - The frequency at which the crawler should run. It can hold values Hourly, Daily, Monthly, Weekly.

Day of the week - If the frequency is weekly, the day of the week at which it should be scheduled.

Day of the month - If the frequency is monthly, the day of the month at which it should be scheduled.

Hour - The hour at which it has to be scheduled.

Command - The command to be executed by the scheduler. It should be of the form ‘python folder_path to .py file’. We are getting this as an input so that if the folder structure/file name changes there will be no code level changes.

Folder - The folder to which it should be redirected to execute the command. It is an optional attribute. It is mandatory only for crawlers that use scrapy. This is because scrapy executes from only with the framework.

Show - Displays a detailed view of the selected Task.

Edit - Gives the option to edit the selected task.

Delete - deletes the selected task.

Run now - Runs the selected task now in addition to its scheduled time.

New Task - creates a new task based on the inputs given by the sporTect B.I developers. The task will be ingested into the database. A task id is created internally by the database.

3.4.2 Players Tab:

This holds the player names and the details about if the crawler should get their facebook or TM_T02_F14a_V2.0 29 Date: 12/08/14

http://localhost:3000/tasks

http://guides.rubyonrails.org/command_line.html


twitter data.

Name - The name of the player.

Facebook - This is set to true if the player’s facebook data has to be crawled. It is set to false Otherwise

Twitter - This is set to true if the player’s twitter data has to crawled. it is set to False otherwise.

Show - Shows a separate view for the selected player

Edit - Gives the SporTech B.I Developers the option to edit details of an exisiting player.

Delete - Deleted the selected player.

New Player - creates a new player data based on the inputs given by the sporTech B.I developers. This data is stored into the database and used when the Facebook and Twitter crawlers are executed.

3.5 SchedulerThis component invokes the required scripts at the mentioned time. The time/schedule is defined by the SporTech B.I Developers via the developer UI.

The scheduler first checks if it an individual task that is invoked via a ‘Run now’ option from the developer UI or it is the general scheduler. If it was an individual task, the scheduler executes that task alone. Else, it executes all tasks that are due.

The task id is passed as an parameter from the Developer UI if the ‘Run Now’ button was clicked.

if (task_id is not None):

run_by_task_id(job_run_dao, task_dao, task_id)

else:

run_all(job_run_dao, task_dao)

run_by_task_id method selectes an individual task from the database and executes that task alone.

tasks = task_dao.load_by_id(task_id)

if len(tasks) == 0:

Util.error("There not tasks with id of : " + task_id)

return

run(job_run_dao, tasks)

TM_T02_F14a_V2.0 30 Date: 12/08/14


run_all method executes all the tasks from the database based on time.

tasks = task_dao.select_all()

tasks_to_run = tasks

tasks_to_run = []

tasks_to_run.extend(get_hourly_tasks(tasks))

tasks_to_run.extend(get_daily_tasks(tasks))

tasks_to_run.extend(get_weekly_tasks(tasks))

tasks_to_run.extend(get_montly_tasks(tasks))

print (tasks_to_run)

run(job_run_dao, tasks_to_run)

The run method executes the required scripts.

for task in tasks:

job_run_id = start_job_run(job_run_dao, task)

t = threading.Thread(target=do_run, args = (job_run_dao, task, job_run_id, ))

t.daemon = False

t.start()

At the end of the task, we determine the status of the job and then log it in the database.

Util.debug("ending job: " + str(job_run_id) + " status: " + job_status)

update_query = "UPDATE JOB_RUN SET END_TIME = %s, STATUS = %s WHERE JOB_RUN_ID = %s"

cur = self.connection.cursor()

result = cur.execute(update_query, (psycopg2.TimestampFromTicks(time.time()), job_status, job_run_id))

self.connection.commit()

TM_T02_F14a_V2.0 31 Date: 12/08/14

greenbay.usc.edu · web viewfirst_name text, middle_name text, last_name text, full_name text not...

Documents