RISE CRADLE 崛起搖籃: 網路爬蟲實作與原始碼分享 Web Crawler implementation & sharing

心血來潮將之前寫的網路爬蟲Web Crawler稍微重構後開源到Github上，以下是簡單的設計介紹及使用說明。

Github：https://github.com/A-Ho/Crawler

簡介：

採link by link的方式大量截取網頁資訊至用戶端
可自行設定探勘深度
支援多線程(multi-thread)
資訊存取方式以Cobweb.java作為abstract class，分別以memory、DB、disk三種資料存取來實作。(後來才知道用到了template method pattern)
支援單獨檔案下載

環境：

Java version >= 1.5
MySQL

輸入值：(同XML配置檔)

起始URL
爬行深度
過濾URL
下載目錄
多線程個數
工作完成自動通知...etc

開放方式：

API
XML Config
Java Swing GUI

開發過程心得：

如果是一個簡單的網路爬蟲，或是只爬幾百頁的小爬蟲實作其實不難，但量大到幾萬筆、幾十萬筆隨便就可以把記憶體全部吃光，突然發現記憶體的處理變得相當重要。
適當將用不到物件指為null或是 system.gc() 強迫提醒 JVM 做記憶體回收，或是將不可預估的大量資料用DB或是現在很流行的分散式儲存系統來分擔會是個好方法。可用gcviewer.jar查看記憶體使用情況，然後有效調整程式寫法。
目前的程式已經運算層與資料層獨立切開，便於抽換各類資料存取方式。
由於此類工具嚴重吃網路頻寬，許多網站加入 robut.txt 做非強迫性限制，或是站台會自動根據你的連線量直接檔 ip，所以使用上請斟酌設定 sleep time 避免ip被擋。(台灣有些碩士論文也有提到這個問題，建議 sleep time > 30 sec)
寫這種程式相當耗時，卡在資料要抓個十幾 GB 後才會發現程式哪裡寫得不好，這類的程式又沒辦法單純靠單元測試就可以檢驗出問題，所以光是找出問題大概就得花掉一個晚上的時間。

重構參考：

Practice in Java
Effective Java
重構：改善既有程式的設計

使用範例：

final String url = "http://www.bitauto.com/";

final String filePath = "E://download/";

Cobweb multiThreadSpider = new MemoryBasedCobweb();

multiThreadSpider.addDownloadFileType(FileType.jpg);

multiThreadSpider.addUnsearchQueue(new WebPage(WebPage.DEPTH_BASELINE, url));

multiThreadSpider.addDownloadRangeList("http://www.bitauto.com/");

multiThreadSpider.addDownloadRangeList("http://beijing.bitauto.com/");

final int threadCount = 8;

ThreadSpider[] phs = new ThreadSpider[threadCount];

for(int i=0;i<phs.length;i++){

phs[i] = new ThreadSpider(url, filePath, multiThreadSpider, 8);

}

Thread[] searchs = new Thread[threadCount];

for(int i=0;i<searchs.length;i++){

searchs[i] = new Thread(phs[i]);

searchs[i].start();

Thread.sleep(3000); // Wait first spider thread

}

XML Config：(較多、詳細的設定)

<?xml version="1.0" encoding="utf-8" ?>

<proxy-host></proxy-host>

<proxy-port></proxy-port>

</enviroment>

<job>

<start-url></start-url>

<minig-depth></minig-depth>

<download-range>

</download-range>

<sleep-time>0</sleep-time>

<disk-output>

<file-path></file-path>

<max-size></max-size>

</disk-output>

<database-output>

</connection>

</database-output>

<finish-notify>

</finish-notify>

</job>

</spider>

RISE CRADLE 崛起搖籃

2013-05-20

網路爬蟲實作與原始碼分享 Web Crawler implementation & sharing

沒有留言:

張貼留言

熱門文章