Re-Engagement

data

This blog was meant as a place to share ideas, musing and discussions on topics of interest like Machine Learning, Data Science, Computational Social Science and some side projects. The idea was to have at least one post every quarter. However my father passed away in October 2013 which led to a period of withdrawal and introspection for me. After almost 3 years I am again in a position to reengage with the community and also share what I have been up to during this time. Hopefully I will be able to post on a regular basis from now i.e., at least once a quarter and share some thoughts on the aforementioned subjects. Here is a list of things that you can expect in the next 6 months or so.

  • Analysis of Sexual Harassment data.
  • Distribution of Muslim population in NYC.
  • New results from the Hadith Networks project.
  • Primer on Statistical Distributions in Nature.
  • Update on the personality emulation project.

In the meantime I will also be writing on 3 Quarks Daily as a columnist on a regular basis. If there is any subject that the readers would like me to cover or is there any citizen data science project that you would want me to work on then be sure to leave that in the comments below.

Do Algorithms have Politics?

004
(Heatmap of users tweeting the N word in the US, from the Geography of Hate project at Humbolt University)

In Do Artifacts have politics? Langdon Winner identifies that certain technologies are democratic or autocratic regardless of the intent of the creators of the technologies. The most well known example that Winner uses to illustrate his point is a set of overpasses that were made in Long Island in the 1930s. These overpasses were purposely built so low that public buses could not pass under them. These had the desired effect that public transportation could not reach certain beaches which were mostly used by rich white folks. The road infrastructure itself excluded access by people of color and working class white people. Even though there are many laws against racial discrimination it would take millions of dollars to change the overpasses so that the road infrastructure entrenches an aspect of racism and classism.

With the recent racist Google Maps hack, the question of racial bias in Big Data and racism in algorithms has come to the fore again. However the Google Map example is closer to the case of consciously rigging the data so that it produces certain results which is more akin to google bombing. Consider the case of Latanya Sweeney which triggered the debate regarding racial bias in algorithms. Dr. Sweeney is an African American and a Professor at Harvard University as well as the director of the Data Privacy Lab at Harvard. Dr. Sweeney observed that search for tradional African American names on Google return suggestions for looking at the person’s criminal record but that was not the case of traditional Caucasian names. This created some ruckus in the media and even Google stepped in and apologized after stating that there was nothing intentional on its part. While Google has fixed this problem, the same cannot be said for other search engines like Yahoo! Here is the screen grab of the search results for Latanya Sweeney with the ad for criminal records highlighted here.

001

Dr. Sweeney is not the only person at Harvard in her department whose name triggers this ad but it is only triggered by her name and the other African American faculty members in the department. Here is the other example.

002

If one searchers for the other faculty members in her department then one does not get such links in the ads. Much has been written about this issue in the past and it was supposedly fixed. It seems that Google is not the only one that needs to fix this issue in its backyard. People of color face this issue on a regular basis.

Another simple but telling example is that of searching traits of people. The auto-suggest function in Google reveals what similar search terms people have search for in the past. The following two examples of a religious and a racial group speak for themselves.

005

006

Even the Daily Show highlighted the issue last year with their segment on racism. What is going on over here? These searches reveal more about the population of users in Google. More often than not it is by accidental auditing that one discovers these flaws in technological systems. People who argue against any sort of tweaking in the algorithms argue that the algorithms are a mirror of reality. What this aphorism fails to acknowledge is that it is not the physical reality bur rather the social reality that we are mirroring. Social reality by its very definition inherently flawed and biased towards one group or the other/

Adding “objectivity” to any algorithm or systematic analysis would add bias not because the analysis itself or the algorithm that is used to analyze the data is biased but rather the systems (e.g., law enforcement departments or the judicial system) that generate the data have bias. Consider the scenario where African Americans are more likely to be incarcerated for a particular offense but other people are less likely to be charged. Over time the data will show higher crime rate and incarceration rate for African Americans even though it is the bias in the system itself that is leading to this state of affairs. Any algorithm or other type of analysis will reveal this observation. The bias will remain until and unless we add explicit conditions to check against other rates of incarceration for the same crime committed for other groups or segment the data.

Such type of biases can also carry over to other domains like recommendation systems. Consider the infamous case of admissions decisions at the St. George’s Hospital Medical School. On the surface, the idea of having an unbiased system that uses past data of admittance to make decisions about the future makes sense as it would not have the same bias as human beings in decision making. However just the fact that we are using past decisions which could have been made by biased people does not reduce the bias. It rather perpetuates the bias because if minorities were left out in the past because of some systematic bias then even the “unbiased objective” algorithm will be making the same biased decisions.

One way to reduce the bias and tackle this issue is algorithmic auditing. Consider the following illustrative example. Based on historical transaction data a targeted advertisement campaign targets 1,000 users. All of these happen to be white even though the algorithm is using the history of click, usage and purchase patterns to determine which users should be targeted. A question arises, is the algorithm being racist? At a fundamental level of course not because there is nothing explicit in the algorithm that states that it should target or not target a particular set of people. It is the bias in the system (judicial, educational, governmental etc) that leads to the production of data which is then fed into the algorithm.

Why stop here? In the future we may have a scenario where some people who want to introduce automation even in jury decisions and a judge’s decision. Just imagine the result if historical data is used by such a system or algorithm to make its decisions without any tweaking or conditions. We may end up with a scenario where this efficient computing judiciary is as biased if not more biased as compared to its human counterparts. Crime prediction and sentencing thus has the potential to be a socially divisive issue. The flipside of racial bias in data is that one can also use big data to point towards systematic bias in the system. To sum up the argument one can say while algorithms and data may not be racist.

nyc_mta_taxi

NYC MTA Taxi: Some Observations (Part I)

The release of the NYC MTA Taxi data covering all Taxi rides in 2013 and obtained through the Freedom of Information act by Chris Whong has been making the rounds on the internet. Here is my first take on the data. The following analysis is based off of 173 million transactions. The total amount of money spent is $2,561,345,362 or 2.6 billion dollars. The data can be divided into two main parts, transaction data and trip data. For the first set of analysis we will focus on the transaction data. Here is a high level analysis of the transaction data for the taxis.

Payment Type Credit Card Cash DIS NOC Unknown
Number of Transactions 93,334,004 79,110,096 127,309 401,483 206,867

From the table it is clear that the majority of the transactions that take place are credit card based. Now Lets look at the breakdown of fare, surcharge, tips etc (Figure 1). The distribution of the various charges for MTA Tax, Surcharge and the Toll Amount appears to be more of less the same. The differences however show up in the other transaction types. The interesting thing is that for the other transaction types the tip is almost non-exist. Since it is unlikely that people do not pay tips when they pay cash for a taxi trips. There are two possibilities, either the tip amount is already included in the reported amount or the tips are scarcely being reported.

image (2)

We can get a better idea about under reporting or over-reporting if we look at average transaction cost. We can glean this information by looking at the average of each type of charge for the various form of transactions (Figure 2). The main thing noticeable here is that the tip amount is minuscule for non-credit card and non-unknown cases. On average the tip amount is approximately $2.5 for both of these two categories. If we make the reasonable assumption that the tip amount is being under-reported in the other cases as well and it is in fact also $2.5 then we get the astonishing figure of $199,097,220 as the amount of tips which are under reported in the taxi data. This brings the adjusted total amount of money spent to be $2,760,442,582 or 2.76 billion dollars. I think even this number is an underestimate, it becomes clear if we compare these numbers to aggregate stats for tips for previous years.

image (3)

Since we are mainly interested in the general patterns then the rest of the summary of the data will be based on the un-adjusted raw data for the tips unless otherwise noted. Now let us look at the spending patterns over the course of days of the week as in Figure 3 and summed over the whole year where the main trend is that the amount of money spent goes up as the week progress peaking on Friday and the declining again. No real surprises here, although I was expecting Saturday to be a little higher since people do travel more often on the weekend. It would be interesting to compare the volume of taxi rides to the volume of subway rides in NYC to get a fuller picture of transportation behaviors of New Yorkers.

image (5)

The next step is of course to ask how do the spending patterns vary over the course of the day (Figure 4).  The activity is much lower after midnight and early hours of the day the pace picks up after 5 am with a notable dip around 4 pm which is right before rush hour in the city.

image (7)

Not surprise here, not let us look at the total money spent for each month (Figure 5). Nothing exciting here also except the dip that we see in summer where the total money declines from $230 million at the beginning of summer in May to $190 million in August; hence a decline of $40 million over the course of summer before picking up pace again. It is most likely the case that New Yorkers are taking taxis less often in summer.

image (8)

Last but not least we can also look at the breakdown of the money spent over the course of the year broken down by weeks (Figure 6). One thing is clear that spending goes down close to major holidays (New Year, Memorial Day, Independence Day, Thanksgiving) and dramatically during Christmas holidays. This is not at all surprising but we do see a few unexpected things: There is a spike of spending when the day light savings start as well as a sharp decline right before Spring Break in NYC. The unexpected thing is the sharp decline coinciding with the end of Islamic month of Ramadan and just a few days before the beginning of the Islamic Festival of Eid. Not coincidentally NYC is also the city in the US with the largest number of mosques. Could this be an indirect indicator of the presence of a large Muslim population in the city? Additional analysis and data may reveal more insights into this phenomenon.

nyc_taxi_weekly

In the next post I shall analyze the trip data and see how all of this fits together.