DMI Tool Wish List

  • Compare Lists? and Triangulate currently see http://govcom.org, http://govcom.org/, http://www.govcom.org, http://www.govcom.org/ as 4 different urls. The suggestion is to exclude http://, www. and / in the comparison.
  • Een http://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/Approved scraper, output tabel met naam - page - tasks - contribs - actions log - block log - flag log - user rights.

  • The Dmi.Tool Wikipedia Bot Edits Scraper seems to be broken. For the wikipedia page on Barack Obama http://en.wikipedia.org/wiki/Barack_obama it only gives 1 result. A look into the history shows more bot and tool activity.
  • The treemap generator output displays blocks according to size, all in burgundy red. Make an option for other colours and an option for creating a tree map/heat map.
  • The launch button of treemap generator and raw text 2 treemap reads 'clouds to svg/pdf' but only produces svg (no pdf. Also, the tool does not recognize 'polar bear (10)' as one term. It sizes 'bear' and leaves out polar.
  • New Technorati Scraper. Old ones are broken:
  • overall log in system
  • faceted search for open calais
  • split bla (2) in bla bla, and reverse. Per line or just bag of words
  • count lines in common in triangulator
  • Wikipedia Bots, just show edits, make bots optional (checkbox)
  • install local version of wikipedia (usefull for synonyms, wikipedia networks, etc)
  • scaleWorld interface (reminder: switch tag)
  • google scholar scraper
  • yandex and baidu scrapers
  • issuedramaturg volatility
  • significance measures on engines: e.g.chi square
  • inversed tag cloud
  • screenshot generator
  • Labels in issuegeographer do not display anymore
  • Issuefeed: issues through time
  • Option to set the colourscheme for a scheduled issuecrawl. (Now every crawl gets a new colour scheme, it would be very useful if a scheduled crawl could use the same colours for .org, .com, etc, every time it crawls the same source set. This would make comparison over time a lot easier. -- sabine)
  • the circle map option in issuecrawler is not working: Circle Map - Due to persisting problems, the circle map has been disabled. The circle map returns in 2008.
  • Webpage History Generator - Uses the Internet Archive's Wayback Machine to make screenshots of all different versions of a site and output a webpage history scroll.
  • technorati tag analysis, as brought up in Source Distance Exercise Group 1
  • cross device/cross spherical tag cloud generator. This tool proposal is a combination of the Analyze tool and the Tag Cloud generator. In Analyze it is now only possible to compare 2 lists. To make a cross spherical tag cloud the number of possible input lists should be expanded to at least 3 (up to 5). For visualization purposes the output from Analyze ("Sites that are common to list1 and list2"; "Sites that only appear in list2"; "Sites that only appear in list1") can be combined with the Tag Cloud Generator by adding frequency of appearance of a keyword or site between brackets. When comparing 2 lists, "Sites that are common to list1 and list2" should have the addition (2) behind every site, while "Sites that only appear in list2" and "Sites that only appear in list1" should have (1) added. The results file would be a .svg that can be further designed in Illustrator. If possible, other design features could be added such as a diffent color for each list and overlapping results (list1=blue, list2=red, overlap=purple), and an illustrator svg filter that organizes the layout of the results automatically. Cross spherical source comparison / analysis as brought up in SourceDistanceExerciseGroup1
  • common wikipedia memory analysis / meme browser, as brought up in http://www.justlol.net/devel/cvs/wikipediaNetwork
  • image scraper die niet headers & footers meeneemt in results
  • tagcloud over time movie maker (1. take all results from archive.org 2. make tag cloud 3. view changes over time (see "growth" and "decline" of words used thus pointing to a shift in importance)
  • Tool proposal voor het scrapen van Hyves netwerken in Google maps.
  • anchor text scraper for technorati en yahoo! results.
  • Nytimes.com Archive scraper. from http://nytimes.com/search (Erik, I think this will be useful in combination with your discovery tools - michael)
  • MeScraper. a scraper that works like coffee but is more healthy.
Done:
  • NetworkCloud asks for a network_id. In order to achieve better consistency this should be Insert an Issuecrawler XML file.
  • Yahoo inlink Scraper - Gets all the inlinks to a site from Yahoo.>howto.
  • De.licio.us Related Tags Cloud Generator - Create a tag cloud showing URLs and tags related to a specific issue or keyword.
  • Delicious tags for url - Get delicious tags for a specfic URL (tagcloudable).
  • New Link Ripper/Harvester tool. When harvesting URLs from Google results for instance, not all URLs are fetched since some don't start with "http" or "www" (such as en.wikipedia... ). Therefore the list is incomplete as input for other tools. Link Ripper fetches all URLs form a page, but this is at the same time the problem. In Google results are returned double, one in the title and one at the bottom of each result (green) and in Google Blog Search the title URL and the green URL are different (specific page and host). Cleaning up the URL list can be done partly automatic (using Analyze for double Google results) or manually (Google Blog Search results), but it would be better if this was fully automated. Ripping only title URLs or green URLs results, best if this is optional, would be a useful addition to the DMI toolbox. Or, if possible (and probably best), harvesting not only by "http" and "www" but also by . without spaces (text.text) or ".com" ".nl" ".org" etc -> see Tool Harvester, Tool Triangulation, and Tool Compare Lists
  • URL cleaner - return hosts like in Analyse, but without deleting duplicates (and still keeping the same order as input). Completed by Erik, added to analyse tool
  • linkrip -> screenshot -> flv slideshow (e.g. to see a sites' evolution in time through the wayback machine)
  • tag cloud to svg, as brought up in Issue Image Analysis
  • Searchfield in technorati scraper zelfde als in google news scraper maken


Tags:

create new tag
, view all tags
Topic revision: r32 - 25 Jun 2009 - 12:30:01 - Esther Weltevrede