Harvester


Extract URLs from text, source code or search engine results. Produces a clean list of URLs.
 

Instructions

Input text in the harvester to extract URLs.

Tip: On a website, view source. Copy and paste source code into harvester in order to extract the URLs (or embedded links).

Tip: For the results of a Google query, view source and copy and paste the source code into the harvester. To extract only the URLs from the results, choose the setting 'only return uniques' as well as 'Exclude URLs from Google and Youtube '. To extract only the hosts from the results, choose the previous two as well as 'only return hosts'.

Sample project

Project: Extract URLs from the Daily Kos blogroll

  • Go to dailykos.com
  • View page source (in Firefox, choose View>Page Source or press ctrl-u)
viewsource.jpg

  • In the page source, find the relevant text under blogroll
  • Copy and paste into the Harvester, outputting a list of URLs ready for further analysis, e.g. using the Issuecrawler

Topic revision: r7 - 05 Jan 2010 - 13:10:46 - Richard Rogers