DMI 08 Wishlist

  • Done URL cleaner - return hosts like in Analyse, but without deleting duplicates (and still keeping the same order as input). Completed by Erik, added to analyse tool
  • Nytimes.com Archive scraper. from http://nytimes.com/search (Erik, I think this will be useful in combination with your discovery tools - michael)
  • Option to set the colourscheme for a scheduled issuecrawl. (Now every crawl gets a new colour scheme, it would be very useful if a scheduled crawl could use the same colours for .org, .com, etc, every time it crawls the same source set. This would make comparison over time a lot easier. -- sabine)
  • the circle map option in issuecrawler is not working: Circle Map - Due to persisting problems, the circle map has been disabled. The circle map returns in 2008.

DMI Summer 07 Wishlist of tools

  • technorati tag analysis, as brought up in SourceDistanceExerciseGroup1
  • cross device/cross spherical tag cloud generator. This tool proposal is a combination of the Analyze tool and the Tag Cloud generator. In Analyze it is now only possible to compare 2 lists. To make a cross spherical tag cloud the number of possible input lists should be expanded to at least 3 (up to 5). For visualization purposes the output from Analyze ("Sites that are common to list1 and list2"; "Sites that only appear in list2"; "Sites that only appear in list1") can be combined with the Tag Cloud Generator by adding frequency of appearance of a keyword or site between brackets. When comparing 2 lists, "Sites that are common to list1 and list2" should have the addition (2) behind every site, while "Sites that only appear in list2" and "Sites that only appear in list1" should have (1) added. The results file would be a .svg that can be further designed in Illustrator. If possible, other design features could be added such as a diffent color for each list and overlapping results (list1=blue, list2=red, overlap=purple), and an illustrator svg filter that organizes the layout of the results automatically. Cross spherical source comparison / analysis as brought up in SourceDistanceExerciseGroup1
  • common wikipedia memory analysis / meme browser, as brought up in http://www.justlol.net/devel/cvs/wikipediaNetwork
  • image scraper die niet headers & footers meeneemt in results
  • tagcloud over time movie maker (1. take all results from archive.org 2. make tag cloud 3. view changes over time (see "growth" and "decline" of words used thus pointing to a shift in importance)
  • New Link Ripper/Harvester tool. When harvesting URLs from Google results for instance, not all URLs are fetched since some don't start with "http" or "www" (such as en.wikipedia... ). Therefore the list is incomplete as input for other tools. Link Ripper fetches all URLs form a page, but this is at the same time the problem. In Google results are returned double, one in the title and one at the bottom of each result (green) and in Google Blog Search the title URL and the green URL are different (specific page and host). Cleaning up the URL list can be done partly automatic (using Analyze for double Google results) or manually (Google Blog Search results), but it would be better if this was fully automated. Ripping only title URLs or green URLs results, best if this is optional, would be a useful addition to the DMI toolbox. Or, if possible (and probably best), harvesting not only by "http" and "www" but also by . without spaces (text.text) or ".com" ".nl" ".org" etc
  • Tool proposal voor het scrapen van Hyves netwerken in Google maps.
  • anchor text scraper for technorati en yahoo! results.
  • MeScraper? . a scraper that works like coffee but is more healthy.
Done:


Tags:

create new tag
, view all tags
Topic revision: r20 - 03 Oct 2008 - 08:57:46 - SabineNiederer