Data retrieving and asynchronicity

posted by on 2019.03.09, under Processing
09:

I am very much fascinated by data, its retrieval and possible applications to art. We are constantly surrounded by data in text, visual and auditory form, and they tell something about us all, as a society and a species.
In this post I want to write about a simple scenario. Imagine we want to retrieve all the images appearing on a newspaper page, and do something with that. For this simple case, I have chosen The New York Times. We have then a couple of questions to which we want to answer. First of all, how do we get the urls of all the images present in the given page? And second: how do we get these images without compromising the animation happening? To answer these questions, we start at the beginning, and we stop at the end, like the judge suggests Alice during her trial. 😉
Data contained in a webpage is usually represented via a markup language: for instance, HTML is such a language. In a markup language, the different structural pieces of a webpage are “tagged”: each item might have a “title” tag, for instance, which tells us that its content will be a title of a sort. In the present case, we will use XML, since The New York Times provides a .xml file for its various pages. In XML parlance, a xml file can be thought as a collection of boxes called “children” that can contain objects which have “content”, or other boxes which have other children, and so on. Now, each XML file is structured in a slightly different way, so one has to investigate case by case. For instance, you could have problem lifting the very same code that will appear later to, say, The Guardian, since its xml file can have a different arrangement.
Processing offers a class XML to deal with XML files, and to search through its tags. Great! So, after spending some time investigating the RSS feed of the home page of The New York Times, we discover that the XML has a child called “channel”, which inside contains children tagged “item”, which themselves contain a child tagged “media:content”: finally, this child contains a url, which is what we are interested in. Pheeew! Once we get the list of urls, we can download the images with loadImage(), which accepts also urls. Here the problem addressed in the second question above appears. We have to talk about “asynchronicity”. Namely, both the function loadXML() and loadImage() are so called “blocking functions”: in other words, until they complete their task, the code doesn’t go forward. This means that any animation would stutter. If we need to load the images only once, this is not a great problem: we do everything in the setup() function, and forget about it. For the sake of fun, I have decided that I would like to randomly add a new image from some other page while the animation goes on. The way to circumnavigate the problem created by these blocking functions is to use a different “thread”. What does this mean? Java allows to “thread functions”, which means that the function is executed in parallel with the main thread, which in our case is given by the so called “animation” thread. By threading a function, we allow the main thread not to be affected by any slowing of the threaded function. In our case, the function getData() loads up another .xml file, grabs an image, and adds it to the list of images to display.
We can now look at the code

String[] urls ={ "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml",
  "http://rss.nytimes.com/services/xml/rss/nyt/Africa.xml", "http://rss.nytimes.com/services/xml/rss/nyt/ArtandDesign.xml",
  "http://rss.nytimes.com/services/xml/rss/nyt/Technology.xml", "http://rss.nytimes.com/services/xml/rss/nyt/Europe.xml"};

String url;

XML xml;
ArrayList<PImage> images;
int count;
PImage img;
boolean locked = false;

void setup() {
  size(1000, 1000);
  background(0);
  url = urls[int(random(0, urls.length))];
  images = new ArrayList<PImage>();

  xml = loadXML(url); //Loading the XML file;
  String[] names = {};

  XML[] children = xml.getChildren("channel"); //This is the first child of the XML file;

  for (int i = 0; i < children.length; i++) {
    XML[] items = children[i].getChildren("item");  //Images are cointained in items;

    for (int j = 0; j < items.length; j++) {
      XML media = items[j].getChild("media:content"); //Media:content is the tag that cointains images;
      if (media != null) {
        names = append(names, media.getString("url")); //This provides the url which appears as an option in the tag media:content;
      }
    }
  }

  for (int i = 0; i < names.length; i++) {
    images.add(loadImage(names[i]));
    println("Uploaded!");
  }
}

void draw() {
  PImage im = images.get(count % images.size());

  tint(255, int(random(30, 100)));

  image(im, random(0, width), random(0, height), im.width * 0.3, im.height * 0.3);


  count++;
  if ((random(0, 1) < 0.01) && !locked) {
    thread("getData");
  }
}

//Function to be threaded

void getData() {  
  locked = true;
  url = urls[int(random(0, urls.length))]; //Choose a random url among those available;
  xml = loadXML(url);
  String[] names = {};

  XML[] children = xml.getChildren("channel");

  for (int i = 0; i < children.length; i++) {
    XML[] items = children[i].getChildren("item");

    for (int j = 0; j < items.length; j++) {
      XML media = items[j].getChild("media:content");
      if (media != null) {
        names = append(names, media.getString("url"));
      }
    }
  }
  images.add(loadImage(names[int(random(0, names.length))])); //Add the new image to the main list;
  locked = false;
}

If you run the code, you should get something like

scree

As an exercise, try to do something similar with a different website, so to get comfortable with the process of understanding how the given XML file is organized.

pagetop