ahsan siddiqui [email protected] siddiqui [email protected] faculty advisor: jonathan m....

19
Ahsan Siddiqui [email protected] Faculty Advisor: Jonathan M. Smith

Upload: others

Post on 16-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

Ahsan Siddiqui [email protected]

Faculty Advisor: Jonathan M. Smith

Page 2: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

1

Abstract This project is aimed at addressing the problem of privacy on the internet. In particular, it tries to solve the problem that has been created by Google’s dominance of the search engine field. Whenever a user types a search query into Google’s search engine, Google stores all the data it receives. This data includes the cookie ID, internet IP address, time and date of the search, the search terms, and the browser configuration. Because Google provides numerous services, users enter sensitive data such as their phone contacts, street addresses, news preferences, dynamic web content, images, videos, blogs, library preferences, job related activities etc. into Google (Kulathuramaiyer & Balke, 2006, p. 1733). Since Google retains all this data indefinitely, there is a large risk to an individual’s privacy if this data gets into the wrong hands. The system described in this document substantially reduces the privacy risk faced by an individual when using the Google search engine. The GooglePrivacy system allows users to utilize the full power of the Google search engine without having to give up any personal data. This is achieved primarily by routing user queries through a proxy server specially configured to remove all sensitive user data. Every query entered into the GooglePrivacy web page is sent to the GooglePrivacy server instead of directly to the Google server. In turn, the GooglePrivacy server removes all sensitive user information and forwards the query to the Google server using its own IP address and settings. The search results page is then filtered to remove all active content placed by Google and returned back the user. Additionally, all links on the results page are modified to ensure that all further queries are routed through the GooglePrivacy server instead of through the Google Server. As a result, Google records the GooglePrivacy server’s actions as opposed to the user’s actions. Since the GooglePrivacy server is shared by multiple users making a wide range of queries, GooglePrivacy prevents Google from mining data on a single individual. The random query feature adds additional complexity to the task of mining individual data. Thus GooglePrivacy achieves the goal reducing the risk of user privacy generated by Google.

Page 3: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

2

Related Work Dangers of Search Engine Monopoly: With the emergence of Google as a dominant search engine-based company, numerous privacy issues have been created which Kulathuramaiyer and Balke(2006) address in their paper, "Restricting the View and Connecting the Dots- Dangers of Web Search Engine Monopoly." It is estimated that Google accounts for 49% of the billions of daily web queries (Kulathuramaiyer, Balke, 2006) and thus has enormous power to influence the public and private lives of its users. Google’s goal of creating the perfect search engine can only be realized if the company collects enormous amounts of data about its users, and so Google will continue to extract and store as much information as it can. Google has already amassed a wealth of data about users including their “phone contacts, street addresses, news feeds, dynamic web content, video and audio streams, speech, library collections, artifacts etc” from its search engine.(Kulathuramaiyer, Balke, 2006, p.1733) Additionally, the purchase of large user communities, such as YouTube and Orkut, allows Google to profile users within their community. The Google Earth platform allows Google to provide “targeted context-specific information based on physical locations.” (Kulathuramaiyer, Balke, 2006, p.1734). This enormous amount of data allows Google to create user profiles that are beyond the imagination of most people. Since Google is a public limited company, it is only responsible to its shareholders. Kulathuramaiyer and Balke (2006) address the threat of Google turning to exploiting the data they have collected on users in order to increase revenues to satisfy stakeholders. Even though these events are unlikely, this threat does exist and there are no formal laws to prevent Google from profiting for user data that has been mined from their products. In the long run, Kulathuramaiyer and Balke predict that it will become necessary to have internationally accepted laws to “both curtail virtual expansion and to characterize the scope of data mining.” (Kulathuramaiyer, Balke, 2006, p.1738) However, to address this problem in the short run, GooglePrivacy and similar systems are necessary to protect search engine users. Private Web Search: Existing Methods In order to address some of the privacy problems posed by Kulathuramaiyer and Balke (2006), numerous applications have been introduced. Some of the applications, such as iPrivacy, Tor, Anonymizer and TrackMeNot have been discussed later in this section. However, the Private Web Search (PWS) application has come closest to solving the problem of privacy on search engines in my opinion. In their paper, Saint-Jean, Johnson, Boneh and Feigenbaum (2007) discuss the privacy problems with a large search engine, and how their product addresses these problems. “PWS is designed as a firefox plugin that functions as an HTPP proxy and as a client for the Tor anonymity network” (Saint-Jean et al., 2007, p. 84). The application minimizes data sent to search engines by filtering requests and responses with the search engine, and by routing the searches through an anonymity network. Thus, PWS prevents search engines from gathering information about the users such as their location,

Page 4: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

3

ISP, browser information, specific time of query and the query strings. All this data is included in search requests sent to search engines in the form of IP addresses, HTTP headers and cookies. The team’s main objective is to “work towards reducing the linkability of queries to each other as well as the linkability of queries to the users” (Saint-Jean et al., 2007, p. 87). Since GooglePrivacy has similar objectives, the 2 applications are similar in some respects, but completely different in other respects. Both systems have a similar HTTP filter that removes user information and active content from HTTP requests. However, the systems differ in the use of Proxy’s. PWS utilizes the Tor proxy network described above whereas GooglePrivacy uses its own proxy server. Using Tor has the following disadvantages:

• Additional latency of going through multiple servers • Risk of adversary controlling a Tor Server and mining all the data going through it.

GooglePrivacy avoids these problems by using its own proxy server. In addition, GooglePrivacy utilizes random search queries to prevent search engines from associating searches with users. iPrivacy The iPrivacy system is another system designed to tackle internet privacy. Like GooglePrivacy, iPrivacy uses its own proxy server to intermediate communication with other websites. The company’s goal is to provide “privacy as opposed to anonymity” (Smith, 2001, p. 2). iPrivacy uses secure servers to route sensitive user information such as home addresses, financial information, credit cards, IP addresses and cookies. The system edits HTTP headers to remove all user information and to pass the websites as little data as possible. This system also allows for the removal of cookies (which contain sensitive user data) and “eliminates active content such as Javascript and Java” (Smith, 2001, p. 2). The main purpose of the iPrivacy system is to secure transactions on the internet by providing websites with as little user data as possible. This is the factor that differentiates iPrivacy from GooglePrivacy. GooglePrivacy is designed to provide users privacy from search engines as opposed to securing financial transactions. Thus GooglePrivacy targets search engines whereas iPrivacy targets online exchange sites. While, the 2 systems use similar methodologies (Proxy servers and HTTP filters) to achieve these end goals, GooglePrivacy is customized towards preventing search engines from gaining user data. GooglePrivacy’s random search query feature additionally enhances the search engine users’ security and does a better job of preventing adversaries from mining user data than any other system.

Page 5: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

4

Anonymizer The Anonymizer system is designed to remove all electronic traces left behind by the user. The system routes all internet transactions via a secured server, and so no trace of the user’s data is left on any website. While this system solves the problem of data mining, it also removes a number of the features which make the internet convenient to user. This product is different from the GooglePrivacy system because it targets users who prefer total anonymity on the internet. GooglePrivacy targets users who would not like to remain totally anonymous, but would not like their data to be mined by search engines. Tor The Tor system also routes user data through servers. Unlike the services mentioned above however, Tor uses multiple servers known as onion routers to move around data and protect against “eavesdroppers.” The differentiating factor between Tor and GooglePrivacy is the Tor is targeted mainly towards Peer to Peer (P2P) interactions on the internet such as Instant Messaging and SSH, whereas GooglePrivacy is targeted and optimized towards preventing search engine data mining. Tor also has a higher response time due to the multiple routings, and so is not ideal for users who are simply trying to avoid data mining from search engines. Additionally, the servers used by Tor are set up by volunteers. Thus the possibility one of the “volunteers” being an adversary and mining the data exists. TrackMeNot This product is a lightweight browser extension that aims to prevent search engines from profiling users. Thus it has the same target audience as GooglePrivacy. TrackMeNot sends random queries to search engines, to mask the data that the search engines collect on the user. The problem with this approach is that a savvy programmer can create a program to filter out such random searches using machine learning or some other technique. Also, this application is additional overhead on the CPU as it is constantly running in the background.

So while there are numerous products that try to address the problem of internet privacy, none of them can completely satisfy the user that wants to convenience of internet functions, but does not want to have their privacy invaded by search engines. GooglePrivacy is a product that solves this problem.

Page 6: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

5

Technical Approach The GooglePrivacy server is implemented as a Java Servlet, running on an Apache Tomcat server. Java and Tomcat were chosen over PHP and the Apache HTTP server due to the following benefits:

• Convenience: Knowledge of Java meant that it was not necessary to learn a new programming language and familiarity with the language made it easier to learn servlet programming.

• Efficiency: In the Apache HTTP server, a new heavyweight process is started for each HTTP request, whereas a Servlet responds to each HTTP request with a lightweight Java thread (Hunter & Crawford, 1998, p. 11).

• Power: Java’s extensive libraries allow for complex operations to be performed very easily. Operations such as maintaining information between different requests, session tracking and caching of previous computations are simplified to a large extent. (Hunter & Crawford, 1998, p. 11)

On a basic level, the GooglePrivacy system works as an intermediary between the user and the search engine. There are 3 major sections of the system:

• Client Interface • Random Query Generator • HTTP Parser

These sections are shown in the block diagram below:

User User

User

Client Interface

Random Query

Generator

URL Re-Writer

ActiveContent

Remover

Server

HTTP Parser

Page 7: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

6

Client Interface The client interface of the GooglePrivacy system is the interface the user interacts with. A view of the client side is displayed below:

Once a user enters a query string and clicks “GooglePrivacy Search”, the client issues a GET() call to the GooglePrivacy server. This GET() call includes all the user data that Google would normally get in the form of the GET() call itself, the HTTP headers and cookies. However, the user’s data is now passed to the GooglePrivacy server, which masks the user’s search query, retrieves the results from Google and filter’s the results. These operations are further discussed in the HTTP Parser section. The client interface is also responsible for formatting and sending the GET() calls to Google (with the user search query), and then passing the resulting stream to the HTTP Parser. The client interface opens a socket with the Google Server, and creates a stream with the Google Server to send the GET() request. Once the GET() request has been sent, the Google Server returns a stream containing the HTML for the search results page. This HTML is then passed on to the HTTP Parser.

Random Query Generator Random queries are utilized by GooglePrivacy to hide user queries. The primary focus of sending random queries to the Google Server is to mask user searches from other users. If a user checks the search history of the IP address associated with the GooglePrivacy server, that user should not be able to decipher the search queries of other users.

Page 8: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

7

A secondary use of random queries is to prevent Google from being able to identify users based on search patterns. By masking user queries between random queries, it is more difficult to extract a single user’s queries. The random search query generator works by scarping a random query off a list of the 10,000 most common web searches, located at http://voytech.com/keywords/keywords_list.htm. Below are the results of 9 executions of the Random Query Generator: 1. Red 2. Prom Dresses 3. Multimedia

4. Defloration 5. Centrelink 6. Sheet Music

7. Replica Watches 8. Virusscan 9. Placebo

Random queries are issued when a GET() call is sent by the client interface to the Google Server. This masks the actual GET() request sent to the Google server by GooglePrivacy The order of the queries sent is randomized, to make detecting a pattern ever more difficult. The order of queries sent to the Google server from GooglePrivacy is:

• Random number (between 0-2) of random queries • Actual user query • Random number (between 0-2) of random queries

Random numbers between 0 and 2 were chosen because the performance of GooglePrivacy becomes noticeably slower if more than 5 queries are sent to Google for every 1 query sent to GooglePrivacy. 3 instances of sending the search term “Senior Design” via the GooglePrivacy system resulted in the following queries being sent to the Google Server: Instance 1: [Random Query] Cheap Cigarettes [User Query] Senior Design [Random Query] Pet Food Instance 2: [User Query] Senior Design [Random Query] Beautiful Women Instance 3: [Random Query] Free Ecards [Random Query] Civilization [User Query] Senior Design

Page 9: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

8

Thus, there is no pattern to be detected by studying the order of Random Queries, as a random number generator is used every time to determine orders. As the number of users of the GooglePrivacy system increases, the search engine’s ability to extract a singe user’s query out of a large pool of queries will diminishes. Thus, if the GooglePrivacy server reaches a certain level of adoption, the Random Query Generator will no longer be required. However, it is uncertain what this level of adoption is. HTTP Parser/Filter The HTTP Parser is the most essential part of the GooglePrivacy system, and is also the part where most of the work of the GooglePrivacy system is done. The HTTP Parser receives an input stream from the Google Server containing the HTML page of search results, and then performs 2 functions to the HTML page before returning it back to the user:

• Active Content Removal: Any content/links that send information back to the Google Server are removed.

• URL rewriting: Links that lead to further action by the Google Server must be modified to go via the GooglePrivacy server instead of through the Google Server.

To achieve this functionality, the Java classes PATTERN and MATCHER are used. HTML patterns are created after examining the Google search results page, and the input stream passed into the HTTP Parser is checked for these HTML patterns. If a match is found, the pattern is either removed by the Active Content Remover, or the URL in the link is modified by the URL Rewriter to go via the GooglePrivacy server instead of through the Google Server. By using the HTTP Parser, search results pages are filtered as shown below:

Page 10: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

9

…….

1. GooglePrivacy only supports web searches, so links that lead to other Google services are removed by

the Active Content Remover 2. Information dealing with the login of the user(which makes it easier for Google to keep track of users) is

removed by the Active Content Remover 3. The image of the Google logo, which is also a link back to the Google home page, is removed by the

Active Content Remover and replaced by the GooglePrivacy logo 4. The input bar sends all searches to the Google Server. The link in the input bar is changed to send the

query to the GooglePrivacy Server instead of the Google server by the URL rewriter 5. The “Advanced Search” and “Preferences” features are not supported by GooglePrivacy, and so those

links are removed by the Active Content Remover 6. Look at point 1 7. Definitions of words are routed to ask.com, but Google records the query before it is sent, so the links to

definitions of words are removed by the Active Content Remover 8. Sponsored links do not take you straight to the URL indicated, but are first routed to a Google Server

that records usage for AdSense purposes. Thus the link is completely rewritten by the URL rewriter 9. The “Cached”, “Similar pages” and “Note this” features are not supported by GooglePrivacy, and so

those links must are removed from every search result by the Active Content Remover

Page 11: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

10

10. Corrections for misspellings are sent to the Google server, and so these queries are rewritten to go via the GooglePrivacy server by the URL rewriter

11. The links to next search pages also go through the Google Server instead of the GooglePrivacy server, and so must be rewritten to go via the GooglePrivacy server by the URL Rewriter

12. Look at point 4 13. Additional features of Google that not supported by GooglePrivacy, so are removed by the Active

Content Remover 14. Information about Google, not relevant to GooglePrivacy, thus, removed by the Active Content

Remover The resulting filtered page is displayed below. All the content links that lead back to the Google Server, that have been described in the section above have been removed by the Active Content Remover, and all URL’s that need to be sent back to the Google Server have been altered so that they go via the GooglePrivacy Server.

Page 12: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

11

Result: As a result of the measures taken by the GooglePrivacy System, Google receives no information about the user that actually made the query. Thus Google does not have access to any of the information which is obtained from the Privacy Analyzer at www.privacy.net (Privacy.Net: The Consumer Information Organization, 2007), which shows what data can be extracted by a website from a GET() call. The following table shows what data was extracted from my web browser by simply sending a GET() call:

Page 13: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

12

Browser Type and Version Browser: Firefox Fullversion: 2.0.0.13 Gecko: True GeckoBuildDate: 20080311 Crawler: False Browser Security Session Cookies Accepted Persistant Cookies Accepted JavaScriptEnabled: True VBScriptEnabled: False JavaEnabled: True ActiveXEnabled: False SSL: True SSLActive: False SSLKeySize: 128 SSLEnabled: True Firewall: False OpenPorts: xxxxxxx PopupsBlocked: True ImagesEnabled: True HighSecurity: False Connection Details Broadband: True ConnectionType: Firewall: False Proxy: False CompressGZip: True AOL: False MSN: False Scripting Capabilities ActiveXControls: False ActiveXEnabled: False JavaScript: True JavaScriptEnabled: True JavaScriptVer: 1.5 JavaScriptBuild: VBScript: False VBScriptEnabled: False VBScriptBuild: XML: True MSXML: 0 XMLHttpRequest: True DHTML : True FileUpload: Yes

Scripting Capabilities ActiveXControls: False ActiveXEnabled: False JavaScript: True JavaScriptEnabled: True JavaScriptVer: 1.5 JavaScriptBuild: VBScript: False VBScriptEnabled: False VBScriptBuild: XML: True MSXML: 0 XMLHttpRequest: True DHTML : True FileUpload: Yes Domain Information Domain Name: UPENN.EDU Registrant: University of Pennsylvania ISC Networking & Telecommunications 3401 Walnut Street - Suite 221A Philadelphia, PA 19104-6228 UNITED STATES Administrative Contact: ISC N&T Billing Business Department University of Pennsylvania, ISC Networking & Telecommunications 3401 Walnut St. Suite 221a Philadelphia, PA 19104-6228 UNITED STATES (215) 898-2883 [email protected] Technical Contact: Network Operations Network Operations Center (NOC) University of Pennsylvania, ISC Networking & Telecommunications ISC Networking & Telecommunications 3401 Walnut Street Suite 221a Philadelphia, PA 19104-6228 UNITED STATES (215) 573-9631 [email protected]

System Details Platform: WinXP Win16: False WinInstallerMinVer: 2 Plug-in Information Plugin Flash Version: 9.0 r47 Plugin QuickTime Version: Installed (version 7.4.1) Plugin Acrobat Version: Plugin RealPlayer Version: 6.0.11.3006 Plugin MediaPlayer Version: Installed but version not detected Plugin Flip4Mac installed Java Information JavaApplets: True JavaEnabled: True Wireless Device Information PDA: False WAP: False HDML: False Locale Information Country: US Language: English User Language: en-us System Language: Time Zone Difference: 0 Browser Date and Time : Monday, April 14, 2008 11:24:27 PM Browser Date and Time ms: 1208229867453

Page 14: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

13

Connection Details Broadband: True ConnectionType: Firewall: False Proxy: False CompressGZip: True AOL: False MSN: False

Name Servers: NOC3.DCCS.UPENN.EDU 128.91.251.158, 2001:468:1802:103::805b:fb9e NOC2.DCCS.UPENN.EDU 128.91.254.1 DNS1.UDEL.EDU DNS2.UDEL.EDU Domain record activated: 02-Jun-1986 Domain record last updated: 29-Jun-2007 Domain expires: 31-Jul-2008

Thus, a user using the GooglePrivacy system can be ensured that all of this data, which includes preferences, browser settings and to some extent, personal location does not get into the wrong hands. Additionally, the user can also be ensured that all searches performed on the search engine will remain private and will not be used against the user in any way. Technical Challenges There were a number of Technical Challenges associated with implementing this system. Some of the biggest challenges are discussed below:

• Choosing the Correct HTTP Server: I started out coding in PHP on the Apache HTTP server, but after learning the system and building a preliminary prototype, discovered that it was insufficient for my needs. I realized that Java servlets were more suitable for this system and that I was much more comfortable using them. Thus I had to scrap my first prototype. Scrapping an early prototype of the GooglePrivacy server was a difficult decision, because it meant that I had to learn an entirely new system and redo a lot of the work I had done earlier. However, by building the first prototype I learned a lot about web systems and web systems design. The switch to Java Servlets caused a considerable delay, but by doing so, I inadvertently followed Fred Brooks’ advice to “build one to throw away, you will anyhow.” (Brooks, 1975, p.115) By doing so, I believe that I learned a lot from my mistakes and the experience of building a prototype allowed me to create a better and more robust system on the second attempt with Tomcat and Java Servlets.

• Sending GET() Calls to Google from a Servlet: To prevent flooding to its server’s, Google does not allow other servers/servlets to make queries to its search engine. My original plan when building the GooglePrivacy system was to construct a URL using the user’s search term (which was passed in from the Client Interface), and then use the “URL” Library in Java to request the InputStream from the Google Server. The InputStream would contain the HTML of the results page. However, Google does

Page 15: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

14

not allow the Java “URL” library to request an InputStream, so I had to find another approach to get the InputStream. After considerable research I decided to open a socket, establish a connection with google.com, and pass a series of GET() calls that are specially formatted to match a GET() call sent from a Mozilla Firefox browser. This adds additional security to the system as it does not send a magnitude of unnecessary information to Google, and lets GooglePrivacy have across to Google’s search pages. However, this unexpected challenge was a setback, and the time taken to solve this problem meant that other functionality of the system could not be achieved.

• HTTP Parser Corner Cases: The HTTP filter is the part of the GooglePrivacy System that does the most work. The basic functionality which is approximately 90% of the HTTP Parser’s functionality is described in the Technical Approach section. However, the HTTP Parser also has to deal with corner cases which arise with particular/rare search terms. The most notable corner cases discovered so far are listed below:

• Results from other Google services embedded within Search Results: For the search query “Quiznos 19104”, the first search result is from Google Maps, which displays locations of Quizono’s on the map and shows all the locations in the area. However, clicking the map takes you to the Google Maps home page and passes the user’s information to Google. To avoid this, the map is removed by the Active Content Remover and the links are rewritten by the URL rewriter. Google Maps is not the only Google service that is embedded in the search results. The HTTP Parser also accounts for Google Images, Books, Videos, News, Groups, Blogs and Music.

• YouTube results: A search for “Coldplay” leads to search results from YouTube. Clicking the image of the YouTube result or clicking the web link leads the user to YouTube via the Google Server which records the transactions. Hence, the thumbnail of the video is removed and the link modified to go via GooglePrivacy.

• Related Searches: Most searches in Google result in “Related Searches” being indicated at the bottom of the page to help the user search for other terms. The links to these searches are changed to go via the GooglePrivacy Server by the URL rewriter.

• Testing: To discover and fix the numerous corner cases, as well as to debug the system, I had my friends switch from using Google, to using GooglePrivacy for web searches. They were instructed to email me any problems or anomalies they had with the system. Because GooglePrivacy routes all searches via a proxy server, any relative links (such as /search) used by the Google HTML page would not work on GoolgePrivacy, and so it was easy to detect when a search was going directly to the Google search instead of via GooglePrivacy. This greatly aided debugging.

Page 16: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

15

• Conclusion My goals at the onset of this project were to:

• Create a system that would provide substantial benefit, and that people would actually want to use • Learn about web systems and web system programming • Acquire the skill to read and understand documentation of systems and code libraries

I feel that I was able to achieve each of these goals, and that this experience has made me a much better programmer going forward. I have definitely learned more during the course of Senior Design than during the course of any other CSE class at Penn. This is partly due to the fact that there was no one there guiding my hand, and if I wanted something done I had to do it myself. The experience of plowing through badly written documentation, looking up what seems like every code library imaginable, asking friends for help to try and piece together a piece of a system is something every programmer has to go through, and I am glad I got a chance to do it during my time at Penn. In retrospect, one thing I wish that I had done differently over the course of this project was to ask for help earlier. For example, I had no previous experience in HTML programming, so I ran into a problem when trying to put the GooglePrivacy logo on the client interface. I spent a large amount of time looking on the internet and reading through HTML books to figure out how to put the logo up on my webpage, and after I was unable to do so, I went and asked a friend who has a lot of HTML experience. He was able to tell me how to do it in about 5 minutes. So, had I asked him for help earlier, I would have saved myself time, and could have used this time to add additional functionality to the system. Numerous delays caused by similar reasons resulted in me having to sacrifice functionality in order to produce a functional and robust system One important decision I made when I started to face delays in coding was to give up some functionality I had originally planned on including in the GooglePrivacy system, for a more robust system. I decided that having a robust system was more important to the user than having wide array of features. Thus I spent a good portion of time debugging the HTTP Parser and testing for corner cases, to make the GooglePrivacy system more robust. As a result I had to give up the following functionality I had originally planned to include in the GooglePrivacy system

• SSL security between the user and the Client Interface • Access to Cached and Similar Pages via the GooglePrivacy server • Access to other Google services such as Maps and Images • Using GooglePrivacy with other search engines such as Widows Live and Yahoo

However, I am very happy with my decision and I feel that I have created a good, viable system that is required in today’s ever-changing world

Page 17: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

16

References Privacy.Net: The Consumer Information Organization. Retrieved December 2, 2007, from http://privacy.net/ This website played an important part in my decisions to choose this project. This site displays all the information that a user sends to a server simply by passing a GET() call to a server. Letting most websites have access to this information is not a problem for the user, but when this information is provided to a search engine like Google, it can be very dangerous. This is because Google provides a multitude of services and hence collects enormous amounts of very sensitive data that can be linked back to the individual that sent it. This is a sever threat to privacy, and the GooglePrivacy system tries to address that threat. Hansell, S. (2006, Aug 15). Advertisers Trace Paths Users Leave on Internet. The New York Times, Late ed. Sec. C, p. 1. Hansell discusses how advertisers monitor user behavior on search engines and then target ads to these users. In the examples that he describes, it is scary to see how accurate these systems are in profiling a user just based on their search queries. It makes you think about how much information you are giving up to a search engine by typing in queries. With a company like Google, that provides users with mapping facilities, newspaper articles, online communities, email etc., an individual’s entire profile, including very personal and sensitive information, can be created. If misused, this data can cause tremendous hardships for that individual. Hansell also reminds the reader that most of the large search engine companies are public companies, and so their only responsibility is to their shareholders. Thus, these companies cannot be trusted with such sensitive and very valuable information while shareholders are demanding high returns. Hansell’s article provides evidence that a system like GooglePrivacy is needed to address the problem created by Google. It is also a very current article, and so is extremely relevant in today’s world.

Page 18: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

17

Kulathuramaiyer, N. , & Balke, W. (2006) Restricting the View and Connecting the Dots – Dangers of a Web Search Engine Monopoly. Journal of Universal Computer Science, 12(12). 1731-1740. In this article, the authors discuss how Google’s position as a search engine monopolist has given the company the power to influence the private and public lives of its users. The authors point out that all of Google’s mergers and acquisitions have positioned the company to extract enormous amounts of data from the user to learn as much as they can about the user. The authors of this recent article back up their arguments with sound logic, and prove the point that some kind of system is needed to prevent search engines like Google from harvesting user data. This article lends good support to Hansell’s article, because it approaches the dangers of search engines from a different angle. This article points out the dangers of a search engine monopoly to a user, while Hansell addresses how data extracted from user queries can be dangerous. However both articles tie together to show that there is a very real privacy problem that users of search engines face and it must be addressed by GooglePrivacy. Saint-Jean, F., Johnson, A., Boneh, D., & Feigenbaum. J. (2007) Private Web Search. In WPES '07: Proceedings of the 2007 ACM workshop on Privacy in electronic society (p. 84-90). Alexandria, Virginia: ACM This article describes an attempt to create a system for search engine privacy, similar to the system proposed in this document. The authors describe their system effectively, and draw interesting parallels between the problem of search engine privacy and privacy preserving data publishing. They also detail other attempts at the problem of search engine privacy that are also discussed in this paper, such as Tor and TrackMeNot. This lends weight to the argument that the solutions provided by those applications is inferior to those provided by GooglePrivacy. The authors of this paper make a lot of valid points about the need for a search engine privacy system, and I have used many of their arguments in this paper. This document has enhanced my belief that a robust system can easily solve the problem of search engine privacy Smith, J. A Brief Introduction to iPrivacy (2001). Presented for Paris Conference, September 25th, 2001

Page 19: Ahsan Siddiqui ahsanas@seas.upennAhsan Siddiqui ahsanas@seas.upenn.edu Faculty Advisor: Jonathan M. Smith 1 Abstract This project is aimed at addressing the problem of privacy on the

18

This paper is the result of a presentation given by Professor JMS at a conference in Paris, while he was holding the position of Chief R&D Consultant at iPrivacy. It describes the iPrivacy system, how it works and how it solves the problem at hand. At that time, the major issue with the internet was secure financial transactions, and iPrivacy was one solution to the problem. While iPrivacy addressed a different problem, the solution devised by iPrivacy is very similar to the solution implemented by GooglePrivacy, and there are many lessons to be learned it. However, since this paper was written for the purpose of a pitch to investors, the objectivity may be skewed and some of the online threats described exaggerated. It was also published in 2001, which was just the start of Google, and so may not be completely relevant to the issues at hand. Hunter, J., & Crawford, W. (1998). Java Servlet Programming. Sebastopol, CA: O’Reilly & Associates, Inc. This book provides an excellent background to Java Servlet programming. Hunter and Crawford discuss the advantages of Servlet programming over conventional CGI programming, and then walk the reader through a basic Servlet example before delving into the details of Servlet programming. This book is very easy to read and follow, and is structured excellently to allow a beginner to get the hang of Servlet programming. The authors have also included a section devoted to proxy servers which is immensely useful for this project Brooks Jr, F. P. (1975). The Mythical Man-Month. Reading, Massachusetts: Addison-Wesley Publishing Company, Inc The Mythical Man-Month is a book of short essays describing Brooks’ experience managing the development of IBM OS/360. It highlights certain lessons that all software project managers should be aware off when developing large systems. I found a lot of lessons in this book to be true when developing my own system, and reading this book made this think twice about scheduling and assigning time for testing/debugging. Thought this book was written in the 1970’s I feel that the lessons are as valuable today as they were back in 1970, and would highly recommend it to anyone associated with the computer science industry.