cs 111: program design i€¦ · n nofollow - prevents following links. n noarchive - cached copy...

39
CS 111: Program Design I Lecture 20: Web crawling, HTML, Copyright Robert H. Sloan & Richard Warner University of Illinois at Chicago November 8, 2016

Upload: others

Post on 19-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

  • CS 111: Program Design I Lecture 20: Web crawling, HTML, Copyright

    Robert H. Sloan & Richard Warner University of Illinois at Chicago November 8, 2016

  • Most important thing

    n  If you have not yet voted (and are a US citizen), GO VOTE!

  • WEB CRAWLER AGAIN

  • Two bits of useful Python syntax

    n  Don't need either one for the web crawler but will make it a bit prettier: q  break q  list as condition for if or while

  • Break

    n  Causes immediate termination of the loop n  Typical usage: #Weareinsideawhileorforif:break

    n  Use carefully! It can make code very hard to read

  • while (x%2 == 1 and x%3 == 0): x = 9

    while True: if (x%2 == 1 and x%3 == 0):

    break x = 9

    Which of these will exit when x is initially 9?

    A

    B

    C. Both D. Neither E. I don’t know

  • Idiomatic Python: Empty list check

    n  If need to check whether list is empty or nonempty q  Python in Boolean context after if or while treats

    empty list as False, any other list as True n  So if I want to keep processing ls as long as

    it's nonempty: whilels:

  • Crawl all pages reachable from start

    n  List of pages to visit, initially start n  while that list is not empty:

    q  Take a page from the list q  Get its text # need to learn how to do this q  remove that page from to-visit list, add it to

    already-visited lst q  Get all the links in that page q  for each link

    n  if not already in visited list n  add it

  • def crawl(start, limit):

    to_visit=[start]visited=[]whileto_visit:address=to_visit.pop()ifaddressnotinvisited:content=text@URLaddress#(Lect.18)do_the_visit(page,emails,etc.)visited.append(address)iflen(visited)>=limit:

    break

  • Crawl and Scrape!

    n  Crawlers crawl for a purpose n  Our assignment: Grab email addresses n  Could just as easily grab all .jpg or all .mp3 or

    all … files n  Or, a search engine:

    q  Build a dictionary showing which words/phrases show up on which web pages

    n  Or ...

  • A LITTLE ABOUT WHAT'S ON A WEBPAGE: HTML

  • HTML is a markup language

    n  That has evolved n  Was simple c. 1996, and pretty simple c. 2005

    q  Now,peoplewanttocontrolthelook-and-feelofthepagedowntothepixelsandfonts.

    q  Plus,wewanttograbinformationmoreeasilyoutofWebpages.n  LeadingtoXML,theeXtensibleMarkupLanguage.n  XMLallowsfornewkindsofmarkuplanguages(that,say,

    explicitlyidentifypricesorstocktickercodes)forbusinesspurposes.

  • ThreekindsofHTMLlanguages

    n  OriginalHTML:Simple,whattheearliestbrowsersunderstood.

    n  CSS,CascadingStyleSheetsq  WaysofdefiningmoreoftheformattinginstructionsthanHTMLallowed.

    n  XHTML:HTMLre-definedintermsofXML.q  Alittlemorecomplicatedtouse,butmorestandardized,moreflexible,morepowerful.

    q  It's(probably?)futureofwheretheWebisgoing.

  • Markupmeansaddingtags

    n  A markup language adds tags to regular text to identify its parts.

    n  Tag in HTML enclosed by .n  Most tags have a starting tag and an ending

    tag. q  A paragraph is identified by a

    at its start and a

    at its end.

    q  A heading is identified by a at its start and a at its end.

  • HTMLisjusttextinafile

    n  Enter text and tags in just a plain ole ordinary text file.

    n  Use an extension of “.html” (“.htm” if your computer only allows three characters) to indicate HTML.

    n  VS Code works just fine for editing and saving HTML files. q  Just don't try to load them in terminal!

  • Parts of a Web Page

  • PartsofaWebPage

    n  Start with a DOCTYPE q  It tells browsers what kind of language you're

    using below. n  Whole document is enclosed in tags.q  The heading is enclosed with

    n  That's where you put the q  The body is enclosed with

    n  That's where you put headings and

    paragraphs.

  • Otherthingsinthere

    n  We're simplifying these tags a bit. n  More can go in the

    q  Javascript (tons of it) q  References to documents like cascading style

    sheets n  The tag can also, e.g., set colors.

    q 

  • HTML is not a programming language

    n  Using HTML is called “coding” and it is about getting your codes right.

    n  But it's not about coding programs. n  HTML has no

    q  Loops q  IFs q  Variables q  Data types q  Ability to read and write files

    n  Bottom line: HTML does not communicate process!

  • Can use programs to generate HTML

    defmake_page():file=open("generated.html","w")file.write("""TheSimplestPossibleWebPageASimpleHeading

    Somesimpletext.

    """)file.close()

  • HowdoAmazonandEbaywork?

    n  By generating a catalog as Web pages. n  Nobody types up all those product HTML

    pages. n  Rather, the data is in a database, and there

    are programs that put the data into a database and programs that use the database to generate web pages in HTML.

  • OPEN ACCESS: COPYRIGHT AND ACCESS CONTROL

  • Web Crawlers and Copyright

    n  “Google, like other search engines, uses an automated program (called the ‘Googlebot’) to continuously crawl across the Internet, to locate and analyze available Web pages, and to catalog those Web pages into Google’s searchable Web index. As part of this process, Google makes and analyzes a copy of each Web page that it finds, and stores the HTML code from those pages in a temporary repository called a cache.” Field v. Google.

    n  When does making the copy infringe copyright?

  • Python Web Crawler Code

    n  import urllib.request as ur n  connection= ur.urlopen("https://www.cs.uic.edu")

    q  #Accesses the website q  #Issues about unauthorized access, CFAA

    n  content = connection.read() q  #Copies the data into the computer’s memory. q  #Issues about copyright, copyright statute.

  • Copyright: A Bundle of Rights n  The right to

    q  make copies and distribute copies of the work.

    q  make a derivative work. q  publicly display the work q  publicly perform the work

    n  17 U. S. C. §106.

  • Copying Code

    n  Suppose you copy some Python code off a website, and then use it in your homework. You a.  Create a derivative work. b.  Make a copy and create a derivative

    work. c.  Make a copy

  • How Is The Right Created? n  Copyright exists when you create

    q  an original work of authorship q  fixed in a tangible medium of expression.

    n  A work is “fixed” in a tangible medium of expression when its embodiment in a copy or phonorecord, by and under the authority of the author, is sufficiently permanent or stable to permit its to be perceived, reproduced, or otherwise communicated for a period of more than transitory duration.

    n  17 U. S. C. §101.

  • Why Have Copyright?

    n  To promote progress in the arts and sciences.

    n  Assumptions: q  We want enough progress in the arts and

    sciences. q  We won’t have enough unless authors can

    get paid for their works. q  They won’t make enough money if people

    can copy their works for free.

  • Copyright Law A Good Fit?

    n  Is this a good argument? q  We won’t have enough works unless authors

    can get paid for their works q  They won’t make enough money if

    people can copy their works for free. n  In general, is traditional copyright a good

    fit for the digital/Internet world?

  • The Need for Control Without Law

    n  Individuals “must share some general sense that most of the rules governing the situation are appropriate. Otherwise, the cost of enforcement . . . becomes high enough that it is difficult, if not impossible, to maintain predictability.” q  Elinor Ostrom, Institutional Diversity.

  • Non-Legal Norms

    n  Norms can encapsulate a shared general sense of appropriate behavior. q  Think of the family holiday dinner.

    n  Web crawler norms control access by search programs to online information: q  The norms create a way of opening or closing

    the door.

  • Opening and Closing: robots.txt

    User-agent: * [Or bot names] Disallow: /

    --Allows nothing User-agent: * Disallow: [blank]

    --Allows all User-agent: * Disallow: [one or more, each followed by a url]

  • Another Way To Control “The Door”

    . . .

  • n  NOINDEX - prevents indexing. n  NOFOLLOW - prevents following links. n  NOARCHIVE - cached copy not available in

    search results. n  NOSNIPPET - prevents caching and description

    in search results. n  NONE - equivalent to "NOINDEX, NOFOLLOW".

  • Coordination Norms

    n  The use of robots.txt and metatags involves coordination norms.

    n  This is worth a bit more detail.

  • A Coordination Norm

    n  A regularity: q  Web crawlers (a lot of them) obey instructions in

    robot.txt files and metatags. n  Conditional conformity:

    q  web crawlers conform in part because (and only as long as) they think other web crawlers will conform.

    n  Elevators and driving on the right.

  • The Elevator Norm

    n  You enter an elevator with two people in it.

    n  Where do you stand? (FRONT = UP)

    a b c

  • The Nearest Neighbor Norm

    n  You maximize the distance from your nearest neighbor (while maintaining peripheral eye contact with at least one other person).

    Three person distribution

  • Copyright Law A Good Fit?

    n  Now we can return to this question. n  How well do the robot norms and copyright

    law get along? n  The way to answer is to see how copyright

    law responds when someone violates the norm.