Download - Simple tech-talk

李辉荣

2011-07-14

cd ppt-session ls –l

cd 爬虫，WebCrawler

drwxr-xr-x 2 root root 4096 爬虫，Web Crawlerdrwxr-xr-x 2 root root 4096 Linuxdrwxr-xr-x 2 root root 4096 Javadrwxr-xr-x 2 root root 4096 其他

神马玩意？搜索引擎的信息积累 google/baidu/bing… Googlebot, Baiduspider, bingbot, Yahoo! Slurp,

礼貌问题

Robots.txt http://zh.wikipedia.org/wiki/Robots.txt

广度优先，Breadth First Search 深度优先，Depth First Search

http://zh.wikipedia.org/wiki/Robots.txt

几个服务器

蜂拥而上

Down

403

几个服务器

乱序访问

通常实现

FIFO 边界越来越大

LIFO 缺点：不可自拔优点：遍历小站

点时效率好

ls –l

cd Linux


学习使用Linux的必要性

发行版：Debian, RPM, …

服务器多用Linux（Red Hat, CentOS, UbuntuServer, openSUSE）

推荐个人安装Ubuntu（烂货RTX！！）

nano（^+O, ^+X） geditor（GNome）

编辑器之神——Vi 另外一个不介绍，因为我不会。。。

英文http://marius.wirelessisfun.com/2010/tutorial-vi-vim/

中文：http://blog.webshuo.com/2011/02/23/549/

当前服务器上的VI版本

http://marius.wirelessisfun.com/2010/tutorial-vi-vim/





http://blog.webshuo.com/2011/02/23/549/

输入输出重定向，管道

grep tail –f –n ps –ef/-aux top []/[[]]/for/if/while/do/done/then… sed/awk perl –nle 文档：http://10.0.93.16/abs-3.9.1/

http://10.0.93.16/abs-3.9.1/

http://10.0.93.16/abs-3.9.1/

http://10.0.93.16/abs-3.9.1/

基于Debian，Ubuntu的目标在于为一般用

户提供一个最新的、同时又相当稳定的主

要由自由软件构建而成的操作系统。

Ubuntu desktop Ubuntu server 软件安装

利用deb包安装 sudo dpkg -i vim.deb 利用apt安装 sudo apt-get install vim

ls –l

cd Java


编码风格，编程习惯，注释，标记

对抽象编程

使用框架日志

变量、方法命名

格式：Eclipse，Ctrl+Shift+F？代码即注释！额外的注释，可以有，但必须精

中文？英文？标注：TODO，FIXME

线程setName(String threadName) close()

个人理解（不保证完全准确）

接口为类型转化

抽象类为减少重复代码、公用基础功能

优先使用接口

在抽取抽象类的时候尽量考虑Adapter模式

Java JDK类库

Collection框架 java.util.concurrent框架 Stream体系

使用开源框架：

Spring , struts2 iBatis, hibernate Log4j, Jetty, HtmlCleaner, quartz

× System.out.println(msg) √ Logger.info(msg)

查看程序运行历史记录。中文？英文？推荐log4j，功能完善，资料丰富单行模式，视觉块行号一定要有多线程环境下，一定要有%t选项

Log4j.logger.com.xxx=debug,stdout,logfile log4j.additivity.com.xxx=false

ls –l

cd 其他


方向与速度问题的本质

懒不是缺点

HTTP一点点信息获取

SB & NB 实例：爬虫里的Link Extract和Page Analyse

抽象具体任务分解，一个大任务多个小任务

解决任务的途径，不止一种

任何时刻明确大任务是神马，这是本质

实例：下载网页，HtmlCleaner

问题的实质：一个正确的TagNode对象

而非：一个返回正确Charset的方法

当然要把事情做好懒得学习才是缺点

事半功倍的前提：一点学习曲线

例子一键部署脚本

Vi的使用

HTTP URL组成 http://www.google.com:80/search?q=http+url#tag

$protocol//$host$pathname$search$hash $host => $hostname:$port Search queryString

可以用Javascript查看：location.xxx

Status codes

书籍的选择

开源项目

API Document https://www.google.com/reader/

https://www.google.com/reader/

感谢坚持到最后的同学

你们辛苦了~

Blog: http://blog.blacklee.net/ Twitter: http://twitter.com/liltos

http://blog.blacklee.net/

http://twitter.com/liltos

Download - Simple tech-talk

Top Related