I use this blog as a soap box to preach (ahem... to talk :-) about subjects that interest me.

Monday, April 23, 2012

A real small-world network #2

It turns out that I hadn't counted the airports correctly: they are 698, not 699 as I stated in my previous article on this subject.

Furthermore, MMU (Morristown NJ) and SSI (Brunswick GA) are only connected with one another. Same story with MXY (McCarthy AK) and MYK (May Creek AK). Therefore, they are not really part of the network.

This leaves a network of 694 airports. Here is a table showing the minimum path lengths between all possible pairs:

Length    #
      1     3,692
      2    42,794
      3    74,998
      4    74,064
      5    39,197
      6     5,452
      7       268
      8         6
Total  240,471

Indeed, 240,471 corresponds to all possible airport pairings: 294 * 293 / 2.

As a result, the average path length is 3.498. Quite short, when considering that the network consists of 694 nodes.

Next, I have to calculate the clustering coefficient of the network as the fraction of pairs of nodes that, being directly connected to a third node, are also directly connected with each other.

Another way to see clustering is to consider triplets of nodes that are directly connected. If the nodes of a triplet are connected by two links, the triplet is said to be open. If all three nodes are directly connected to each other (i.e., with three links), the triplet is said to be closed. The clustering coefficient is calculated by dividing the total number of closed triplets by the total number of triplets.

We already know that the total number of open triplets in the network of US airports is 42,794, because that is the number of nodes connected via an intermediate node but not connected directly to each other. Those triplets are open because I programmed the database to rejected duplicate pairs. Therefore, pairs found to be connected via a path of length 2 couldn't be inserted into the database if they were already present with a path of length 1.

I wrote a program that, instead of attempting to insert pairs connected with a path of length 2 into the database, counts the cases in which the third node is connected to the first one. I avoided counting more the once the triplets by only considering triplets in which the nodes were in alphabetical order.

The program counted 17,440 closed triplets, resulting in a clustering factor of 0.29. This is higher than the clustering of the Web and of the Internet, as shown in a presentation slides dated May 2008 (www.newton.ac.uk/programmes/CSM/seminars/050114001.pdf):

Network Vertices  γ Path Cluster
WWW 2 × 108 2.1 16 0.1078
Internet, router 1500002.4 11 0.18
Movie actors 212250 2.3  4.54 0.79
Maths, coauthors 70975 2.5  9.5 0.59
Phone calls 53 × 106 2.1

Synonyms 22311 2.8 4.5 0.7

Original sources: Adamic (1999), Aiello et al (2000), Barabási, Albert (1999),Broder (2000), Watts, Strogatz (1998).

The "γ" indicates how long the right tail of the power-law distribution of cluster sizes is. The lower the γ, the slower the number of clusters decreases with the cluster size. The expression for the power law is as follows: N(size) = K * size, where K is a constant. A γ of 1 means that the number of clusters halves when their size doubles. A γ of 2 means that the number of clusters is reduced to a quarter when their size doubles. And so on.

To calculate the distribution for "my" network, I needed to write another small program that listed the number of links of each airport (i.e., node) or, to say it in a different way, the cluster sizes.

It turned out that there are 128 airports with a single connection to another airport. In case you are curious, the best connected city is Atlanta, with non-stop flights to 158 destinations, followed by Minneapolis/Saint Paul with 146, Chicago with 142, and Denver with 141.

The following plot shows log(N) vs. average log(size). This means that γ = 1.25, and K = 264. K should match 128. Why is it much larger? I haven't got a clue. The discrepancy sounds dramatic, but when considering the logarithms, as it appears on the plot, it doesn't look catastrophic.

And why is the γ half of what reported in the examples? Again, I have no idea.

In any case, now that I have found a bona-fide small-world network, I think for a while I will concentrate on something else.

Saturday, April 21, 2012

Soldiers and body parts. What's new?

The publication of photos that show US soldiers posing with dismembered suicide bombers have drawn almost universal condemnation.

I confess I don't understand what the fuss is all about. Perhaps we have become accustomed to neat pictures of laser-guided missiles hitting their targets with uncanny precision, and have forgotten that war on the ground is not like that.

Although I know that sometimes wars are unavoidable and even appropriate, I am at heart a pacifist and a non-violent. To the point that I avoid swatting flies. That said, the images of soldiers pissing on dead enemies or posing with dead bodies, dismembered or not, don't surprise me in the least.

I would never do it, but then, I am not a soldier fighting a guerrilla war. To kill other human beings is not part of human nature. For the vast majority of people, killing somebody is a shock. Even when the killing is done in self defence. That's why combat training always involves creating automatisms and dehumanising the enemy. The more a soldier sees an enemy as a target or a threat rather than as a fellow human being, the more he is likely to shoot first and survive.

And then, there is the suppressed fear of dying and suffering, which bottles up on every mission and sortie. It is inevitable that under such circumstances, some soldiers go further than what we, watching a sanitised war on TV from our comfortable sofa, find acceptable.

The massacres of My Lai and Srebrenica are excesses that happened because the soldiers didn't consider their victims as other human beings. Same story with the Nazis and the Holocaust: it was OK to kill the Jews because, according to Hitler & Co., they were subhuman.

But to come back to the recent pictures of soldiers and body parts, they are not even something new. I found in my personal library a photographic book in Italian about the Vietnam War. The title, translated into English, is "Vietnam – Against a Genocide" (Vietnam – contro un genocidio), edited by Livia Rokach and published in 1972 by Napoleone Editore. Here is the picture on the front cover:

I know: it is very upsetting. But the point I want to make is that wars are always ugly.

Politicians want to have the cake and eat it too. Fortunately, the soldier who made the photos public, reminded us that war is not nice. The only clean war is a war that doesn't take place at all!

There is also another picture I would like to show you. This one shows a torture that we now call "waterboarding". In Vietnam, they didn't use a board, but the technique was the same. The soldier with a helmet is a colonel.

Wednesday, April 18, 2012

A real small-world network

I am fascinated by small-world networks and, for once, I would like to analyse a real network of which I know the complete structure.  This would make possible for me to calculate its average path length and its clustering.

After thinking about it for a while, given the fact that I am interested in civil aircrafts, I though it would be nice to study the global air-route network.

Years ago, I had a list of all air sectors in existence.  It was in one of those telephone-book-sized manuals marked "ABC" and used by travel agents.  But I must have thrown it away before one of my many moves, and I suspect that nowadays no travel agent has even ever seen such a book.

I couldn't find anything useful on the IATA (International Air Travel Association) and ICAO (International Civil Aviation Organization) websites.

When I searched Google for "nonstop flights XXX", with XXX set to a IATA airport code (e.g., JFK for New York's Kennedy) it showed a list of all cities with non-stop flights to that airport.

"That's what I need!" I thought.  "I only need to download from somewhere a list of all airport codes and write a little script that interrogates Google for all of them.  Then, I can convert the city names to IATA codes to have a list of all airport pairs.  I'll have to merge the lists of airports located near the same city (e.g., FCO for Rome Fiumicino and CIA for Rome Ciampino), but I can do that.  Once I have all city pairs, I can build my network.

It sounded reasonable, but there was a catch: Google makes it very difficult to be able to execute a query programmatically.  So much so that I gave up.

I had to find another network.

Fortunately, I discovered the site http://nsflight.com.  It lists all pairs of US airports that are connected non-stop.  As it turned out, to limit the study to the US was a blessing in disguise, because the analysis of the air routes requires an immense amount of computing time.

On April 13th (a black Friday), I finally had my list of US cities with an airport (699) and a list of directly-connected city pairs.  That's when I discovered that it would take days and days to get a result.

What I needed to build my network was a program to merge non-stop pairs into one-stop pairs, then to two-stop pairs, and so on until all 699 nodes had been connected to each other.  That is an enormous number: 699 x 698 / 2 = 243,951.  Certainly, too big to be kept in memory.  I had to build a database with the pairs.  Here is the SQL script I used:
create database pairs;
create table pairs.pairs (
  one character(3),
  two character(3),
  len integer
create index one_key on pairs.pairs (one, len);
create index two_key on pairs.pairs (two, len);
create index len_key on pairs.pairs (len);
create unique index pair_key on pairs.pairs (one, two);

As you can see, each record consists of a pair of three-letter city codes and a number, which represents the minimum number of intermediate stops (i.e., the minimum path length between the pair of nodes).

The fourth index associates to the pair of city codes a unique key.  This guarantees that each pair can only appear once in the database.  I decided that the nodes of each pair would always appear in alphabetical order.  This avoided useless duplications that I would have had to remove later.

After writing an SQL script to insert the 1-pairs (i.e., the pairs with a single link between them), I wrote a JSP script to discover all possible combinations of pairs "back to back" and insert them as 2-pairs.

The JSP script went through the hard-coded list of city-codes one by one and queried the database to obtain all pairs that had the current city-code in either position and the len column set to 1.  This returns a list of all pairs containing the current city-code.  Then, the scripts repeats the query for each one of the other city-code of each pair, which returns another list of pairs.

For example, "GAM" (Gambell AK) is directly connected with five other cities (Elim, Nome, Shaktoolik, Savoonga, and White Mountain, all in Alaska).  The first query in the script returns this list:

The script executes the second database query for all five.  For example, the query for Shaktoolik returns the following list:

The script then combines the 1-pairs obtained with the second query with the original 1-pair GAM-SKK to form six tentative 2-pairs:

While going through the 2-pairs, the script discards GAM-SKK-GAM because the two ends are identical, but tries to insert into the database the other five 2-pairs with len set to 2. The first one fails because GAM-ELI is already present with len = 1, but the other four succeed.

Once obtained all 2-pairs, I launched the script again but with the len parameter of the second query set to 2.  This gave me all 3-pairs, but I had to go through all cities in three chunks, because the computer ran out of memory.  Two factors contributed to this debacle: firstly, I had to launch the script from a web browser, which gobbled up quite a bit of memory; secondly, the script was executed within the Tomcat Java server, which also used up plenty of memory and was known in the past to have memory leaks.  Even without leaks, it would have probably failed.

At this point, I converted the JSP script into a Java application, and the Windows Task Manager told me that the usage of physical memory remained around 40%.

At the moment of writing, I am running the application with 1-pairs and 4-pairs to obtain 5-pairs.  It has been running for hours, and I expect it to run overnight.  The CPU usage remains well below 15%.  This means that the time goes into database access.  In fact, I can hear that the hard disk is working hard. I don't see how I could speed that up.  Perhaps I could create a virtual disk in memory and use that to store the database.  I am open to suggestions...

The total number of 1-, 2-, 3-, and 4-pairs is 195,550.  The program will still have to run for a while longer to reach the longest of the shortest paths between pairs of nodes.

Once I will have the list of all pairs, I will be able to calculate their average length.  I'll keep you posted.

Thursday, April 12, 2012

Small-world networks (or not?)

I published three articles on small world networks.  The first one was in August 2010: "Network of Feedbacks on eBay".  The second and third articles were in March 2012: "The small world of this blog" and "More on visitors to this blog".

In those occasions, I stated that the feedbacks on eBay and the visitors to this blog formed small-world networks because the number of links attached to was distributed across the nodes according to a power law.

But a couple of days ago, in those moments of semi-lucidity that precede sleep, I realised that the presence of a power-law distribution does not prove the presence of a small-world network.

According to Steven Strogatz, a network is called "small-world" when the average number of links between nodes is low and when clustering is high.  The distribution of the number of links attached to each node doesn't always follow a power law.  Therefore, a power-law distribution is not a necessary condition for a network to represent a small-world.

But the crucial point is that I haven't found anywhere a statement claiming that a power-law distribution is sufficient to identify small-world networks.  In fact, power-laws pop up in many places.  For example, the spectrum emitted by supernova remnants and other non-thermal sources of radiation follows power laws.

The bottom line is that the power-law distributions I discovered in the eBay feedbacks and in the blog visitors, contrary to what I stated in my previous articles, do not automatically imply that we are looking at small-world networks.

Wikipedia states that "a small-world network is defined to be a network where the typical distance L between two randomly chosen nodes (the number of steps required) grows proportionally to the logarithm of the number of nodes N in the network."

Scholarpedia, more generally, states that "A small-world network refers to an ensemble of networks in which the mean geodesic (i.e., shortest-path) distance between nodes increases sufficiently slowly as a function of the number of nodes in the network."

It seems that a universally recognised definition of "small-world network" is not available. But I don't like the definitions given by Wikipedia and Scholarpedia, because they depend on being able to "grow" the network or, at the very least, to have several instances of the network you are studying.  But what do you do if you only have one network to work with?  You cannot simply choose subsets of nodes, because the resulting networks might not be meaningful.

I like the definition used by Strogatz and Watts: a small-world network is characterised by having low "average path length" and high "clustering".  You can apply such a definition to any individual network.  To calculate the average path length, take any pair of nodes and count the number of links that form the shortest chain between them; then, repeat it for all pairs of nodes and average the result.  Clustering measures the level of overlapping in the network and is calculated as the probability that two nodes linked to a common node are also linked with each other.  Or, speaking of frequencies instead of probabilities, clustering measures the fraction of pairs of nodes that, being directly connected to a third node, are also directly connected with each other.

Low "average path length" can be identified in terms of network size.  The most well known example is the "six degrees of separation" with a world population (i.e., nodes) of six billions.  This was sort-of verified in 1967 by Stanley Milgram. He asked 160 randomly-chosen residents of Wichita, Kansas, and Omaha, Nebraska, to send letters to a particular person or, if they didn't know the person, to somebody who might know him/her.  As recipients, he chose the wife of a divinity graduate student in Sharon, Massachusetts, and a stockbroker in Boston. Of the 160 letters, 42 reached their destination, after an average of 5.5 intermediaries.  Now, 6.5 links is clearly very low when compared to the number of nodes in the network.  In case you are interested, the Web has 19 degrees of separation and the Internet 10.

Understanding what "high clustering" means is a bit trickier.  A loop in which all nodes are connected to two neighbours as 0 clustering, because no two nodes connected to the same node are ever connected together.  At the other end of the spectrum, a network in which all nodes are connected with each other has a clustering of 1.  In a fully-connected network, the average path length is exactly 1 regardless of the network size.  Therefore, such a network is a small world, although not particularly interesting.

In general, in loosely connected networks, the presence of "hubs" that connect many nodes is sufficient to reduce the average path length and increase the clustering.  This is what happens in social media: some people with many connections in FaceBook or LinkedIn connect through them pairs of people who would normally be "very far" from each other.

To go back to what motivated me to write this article, the power-law distributions I showed in my previous articles might as well to do with small-world networks, but to prove it is for all practical purposes impossible.

I have to think about it...

Tuesday, April 10, 2012

Linear and nonlinear systems

I am reading a very interesting book: "Sync – the emerging science of spontaneous order", by Steven Strogatz, published by Penguin Books in 2004.

On page 181 of the book, I found the best description of linearity (and nonlinearity) I have ever found. I thought I might share it with you.
The mathematician Stanislaw Ulam once said that calling a problem non-linear was like going to the zoo and talking about all the interesting non-elephant animals you see there. His point was that most animals are not elephants, and most equations are not linear. Linear equations describe simple, idealized situations where causes are proportional to effects, and forces are proportional to responses. If you bend a steel girder by two millimeters instead of one, it will push back twice as hard. The word linear refers to this proportionality: If you graph the deflection of the girder versus the force applied, the relationship falls on a straight line. (Here, linear does not mean sequential, as in “linear thinking,” plodding along, one thing after another. That’s a different use of the same word.)

Linear equations are tractable because they are modular: They can be broken into pieces. Each piece can be analyzed separately and solved, and finally all the separate answers can be recombined -literally added back together- to give the right answer to the original problem. In a linear system, the whole is exactly equal to the sum of the parts.

But linearity is often an approximation to a more complicated reality. Most systems behave linearly only when they are close to equilibrium, and only when we don’t push them too hard. A civil engineer can predict how a skyscraper will sway in the wind, as long as the wind is not too strong. Electrical circuits are completely predictable -until they get fried by a power surge. When a system goes nonlinear, driven out of its normal operating range, all bets are off. The old equations no longer apply.

Still, you shouldn’t get the idea that nonlinearity is dangerous or even undesirable. In fact, life depends on nonlinearity. In any situation where the whole is not equal to the sum of the parts, where things are cooperating or competing, not just adding up their separate contributions, you can be sure that nonlinearity is present. Biology uses it everywhere. Our nervous system is built from nonlinear components. The laws of ecology (to the extent we know them) are nonlinear. Combination therapy for AIDS patients -drug cocktails- are effective precisely because the immune response and the viral population dynamics are both nonlinear; the three drugs in combination are much more potent than the sum of the three of them taken separately. And human psychology is absolutely nonlinear. If you listen to your two favorite songs at the same time, you won’t get double the pleasure.

This synergistic character of nonlinear systems is precisely what makes them so difficult to analyze. They can’t be taken apart. The whole system has to be examined all at once, as a coherent entity.

Friday, April 6, 2012

Buying a Chinese book

As the translators didn't reply (see my previous article), I decided to seek help elsewhere. I sent a message to one of the Yahoo groups I belong to: the Canberra Speculative Fiction Guild (also see the CSFG website csfg.wordpress.com/).

The CSFG Yahoo group consists of more than 200 people interested in Science Fiction, Fantasy, and/or Horror. It is a bunch of enthusiastic and active people and, although the group is based in Canberra, Australia, it attracts members from everywhere.

I thought: the book I'm trying to buy in Chinese has nothing to do with Speculative Fiction, but some of the CSFGers might be able to help. They are always very available.

And it worked! Robin Shortt sent me all the information I needed. Robin, if you are reading this, thanks again.

And here is how you can buy a book in Chinese without understanding a word of it:

  1. Use the web browser Chrome, which you can download from google.com/chrome. The reason for this is that it can automatically translate web pages from Chinese (simplified Han) to English. It doesn't translate the text inside buttons, but sometimes you can see the corresponding URL by placing the cursor over them.
  2. Go to amazon.cn. It offered my book cheaper than the other five online bookshops I could find. This was also a tip from Robin.
  3. Create a user ID. You will not be able to use your "Western" user ID that is valid with amazon UK, US, DE, FR, etc.
  4. When you order a book, to be able to enter your address outside China, you will have to click on the link that is above the form to enter a Chinese address. It looks like this: 海外地址
  5. Pay with a credit card.

I paid 41.30 for the book and 110.00 for shipping.  More or less AUD 23.30.

Thursday, April 5, 2012

My Book in Chinese

For a couple of years my attempts to obtain a copy of the Chinese version of my book "Beginning JSP,JSF, and Tomcat Web Development"  has been completely unsuccessful.

A few days ago, I decided I had to find a way. It took me some inventive "googling" to find the name of the publisher, but I finally found it: product.china-pub.com/195967.

I discovered that the Chinese edition only became available in September 2009, almost two years after the original.

Here is an image of the front cover:

It costs AUD 9.00, much less than the original. I expected that.

Thanks to Google Translate, I was able to send a message to the publisher via the contact page on their website. I quickly received a reply (in Chinese). They told me that I have to buy it through their website and that they only ship domestically.

I sent another email. I cannot give up so quickly, can I?

The book was translated by Shi Xiaohui and Yuan Yong Kay and I believe I have found them. I send them emails and asked them whether they can halp me in getting a copy of the book. But, to be on the safe side, I also send a message to three more people named Shi Xiaohui, two of whom are on FaceBook.

I know: it's desperate...

I just discovered that I can buy the book from Biblio. But it would cost me AUD84. MMmmm... Quite a markup from the AUD9 dollars the Chinese pay domestically...