Social Media is the best Sensor Network of your City.

2016-11-032016-12-19 Hugo Zaragoza1 Comment

People, by using Social Networks, have become one of the most interesting and reliable sensor networks of any City.

Smart City technology in the last few years has mainly concentrated in the issues of networking, security, sensors and sharing open data. However, a very rich source of information about the city has developed in parallel (for free!) and has not yet been integrated in a meaningful way into the Smart City paradigm: people’s opinions, measured in real-time through Social Networks. In this paper we develop the idea that this can be considered as a form of rich sensor network, sending us in real-time a sample of people’s opinions.

Over the last four years we have developed this approach, working with several Spanish city councils, including the cities of Barcelona and Tarragona. One of the main lessons that we have learned is that considerable technological effort is necessary to develop a stack of processes and algorithms using Big Data approaches (with algorithms for Machine Learning, Natural Language Processing and Search), and furthermore that this technological effort is necessary but not sufficient: a high level of human intervention is necessary also before meaningful data can be derived from such opinion networks.

Technological Challenges

Emotions, needs, desires, fears, are being pushed to the Internet in a continuous stream. However, harnessing this information is not easy, and technology it still in its infancy.

A full stack of technology is needed to capture, filter, process, analyse and serve social media data. It needs to be shaped into a coherent set of summaries and indicators, which in turn enable a set of APIs and applications, from real time alert systems to periodic reporting.

Considerable technological effort is necessary at every level of the analytics stack that is required:

Capture: support automatic access and querying to the myriad of Social Network sites and APIs.
Filtering: Disambiguate the name of the city and of its main points of interest (e.g. there are more than 20 Barcelonas in the world). Disambiguate city conversation from its soccer team and from mentions in people’s addresses and job listings, which combined make up to 70% of the volume of captured mentions. Detection BOTs and other artificial methods to generate “false” conversations.
Categorisation: Separate conversation about tourism, health, culture, security and so on (e.g. we use over 50 different topics at three different levels to analyse conversation in the city of Barcelona)
Summarisation: Understanding which are the main trends for a given topic is crucial, we cannot descent to the level of the individual clipping since there are literally millions per month.
Key Performance Indicators: Besides summarisation, we needed to derive numerical indicators.
Big (Natural Language) Data: Almost all the processes mentioned above require sifting through millions of data points on demand, while receiving thousands of new data points per second. Worse of all, the main data points are free textual mentions which are unstructured.
Visualisation: Providing powerful and intuitive dashboards to read summaries and indicators and drill down to specific mentions when necessary.

Beyond the technological problems, for this data to gain acceptance it is very important to:

educate decision makers and citizens in general about the nature of this new “sensor” and the analytics that can be derived from them
establish clear rules to the data collection process, its limits, transparency, limitations, etc.
establish standards and procedures for capturing and sharing analytics denied from social media data.
establish defensive systems capable

Opportunities

Social Network data, once properly filtered and analysed, can be exploited by cities in numerous ways. We will cite some of the areas in which we have worked in the last four years with the city council of Barcelona:

City Tourism : We can analyse demographics of tourist, analyse comments made by tourist of different nationalities about different areas of interest (PoI) in a city. We can rank PoIs by their impact in Social Medias, track the evolution of less known PoIs which are being promoted by the City Council, etc.
City Security and Crisis Response : Using statistical models, we can trigger alerts in real time based on unexpected Social Network activity. Depending on the topic of this alert we can channel it directly to the corresponding department (e.g. health vs. security vs. maintenance).
Polling : sampling people’s concerns and interest before designing questions for a poll make polls more relevant and cost-effective
Branding : we can analyse how large events (trade show, sports event, etc.) are connected to city brand.
Benchmarking: we can draw comparative analysis of the importance (for the population) of issues in one city with respect to a second similar city.

Article written for presentation at the 2016 Smart City Expo.

Image from https://www.flickr.com/photos/55049135@N00/19038712376

Mind Funnels Vs. Mind Webs

2015-05-302016-12-19 Hugo ZaragozaLeave a comment

[Spanish version]

Communication, opinion and conversation, are becoming valuable raw materials, like iron or coal in the past. Those with the best ability to access and manipulate this new material will have a tremendous competitive advantages over the rest. Currently, those materials are consolidating in the hands of a few companies.

Imagine for a moment that someone could put a chip into every human being, translating their conscious thoughts, feelings and emotions, into short text messages. Imagine that these text messages are transmitted to a central sever where they are stored and analyzed.

Short text messages summarizing everyone’s interests, activities, desires, fears, opinions, plans and interactions, collected for every man and woman of every age and every country, in real time. Lets call this a Mind Funnel, since it would funnel the content of everyone’s minds into a centralize server.

In economic terms, how much would this Mind Funnel be worth?

Who could compete against Mind Funnel Inc. in any sector once Big Data analytics and prediction technology is put to work?

Think of brands in retail, firms in healthcare or logistics, think of the public sector, security, politics, media…Think of the Marketing industry for example, and its multiplier effect on all other industries: Mind Funnel Inc. could wipe out any competitors on virtually any sector. Whatever you can imagine as advantages, it will be only the tip of the iceberg… Mind Funnel Inc. is set up for world domination.

Now consider applications such as Twitter, Facebook or WhatsApp. These applications are already digitizing a large fraction of our thoughts, emotions and needs in real time.

Twitter enables public communication (by all and for all). It is a Mind Funnel in the sense that a central Twitter owned repository collects all of our tweets, where they can be analyzed and processed for information in any way they see fit. Similarly Facebook and WhatsApp have enabled private communication, and in doing so they have acquired powerful Mind Funnels for themselves. Facebook for example knows more about the world’s companies and governments that those companies or governments themselves. They can use this knowledge to level all kinds of competitive advantages, as they are currently doing in marketing.

Many owners of such Mind Funnels allow third parties to “listen-in”, to connect to their servers and query for information in various ways. This has already spurred great innovation and growth in business; an entire ecosystem of services has evolved around Social Media already. To cite one recent example, Apple purchased Topsy (a company specialized in Twitter analytics) for over $200M in December of last year.

But the owners of Mind Funnels are very careful to keep locked their core value (and of course, why would they do otherwise?) To cite another recent example, WhatsApp was bought at $19,000M by Facebook this month, 100 times the prize of Topsy, the difference between owning the users analyzing data vs. having the

In other words: communication, opinion and conversation, are becoming valuable raw materials, like iron or coal in the past. Those with the best ability to access and manipulate this new material will have a tremendous competitive advantages over the rest. Currently, those materials are consolidating in the hands of a few companies.

So far Banks and Credit Cards, Hospitals, Insurers, Retailers and many others collect and commercially exploit our information in different ways. Are Mind Funnels different? Should we care differently?

Perhaps we simply need new legislation and industry regulations. Or perhaps this new material will prove simply too sensible and too valuable and we will need public services to collect, redistribute and monetize it, monetize our own thoughts and feelings.

Let’s follow this thought for one second. Imagine building a European Mind-Web. Like in the railroad system, once this network is built, different goods (or applications) can pass through the network, at a prize.

Future social networks and other communication applications (the future Pinterests, Vimeos, FourSquares, WhatsApps, TripAdvisors, etc.) could use this network to quickly deploy new products at a fraction of the cost, without having to deal with infrastructure or legal issues. Data Consumers (exploiting novel BigData analytics and prediction technology) could tap directly into the network. And final users could feel safe in their knowledge that their private communications, thoughts and opinions, are fueling a business ecosystem in a regulated and respectful way. Appropriate privacy and monetization policies could be built into the system to force a fair playground for everyone dealing with this type of data. And, why not? perhaps this network will be obliged to return to the users part of the wealth generated by its users.

Seeing it from this angle, an Open Source Mind-Web is a necessity, it would enable us to regulate the use and wealth derived from the network, instead of watching how a few reap all the benefits and build ever higher barriers of entry around us. In my opinion, such a network could bring huge potential for European growth and innovation.

Hugo Zaragoza (hugo@hugo-zaragoza.net)
(Text written for a Horizon 2020 European experts meeting, February 2014).

Dreaming is Important to Generalize Efficiently

2014-07-242016-12-19 Hugo Zaragoza4 Comments

(Image © chiaink, http://chiarina.com/)

Imagine you have a very simple “knowledge memory” that stores knowledge as an associative array (or map) of ”key=>value” pairs. This memory supports the operators:
* get: retrieve a ”value” for a given ”key”.
* keyIterator: iterate over all keys present in the memory

If we want to learn new things about this data (i.e. to generalize) we need to inspect keys and values and try find interesting correlations, verify certain hypothesis, etc.

Example. We can store many person names and their age (name=>age), and draw conclusions (or at least hypothesis) such as that long names belong to older people, or names starting with “A” are out of fashion in the last 10 years, or whatever.

In order to do this we need to choose certain keys, check their values, come up with models or hypothesis, then get more keys and values, etc.

Doing this with an iterator seems awfully wasteful. We need to iterate over and over disregarding most of the keys to get to the keys that seem interesting for the current hypothesis being verified.

It seems that in order to learn from associative arrays we need first a good key sampler, one that can be biased in some configurable ways so that it yields interesting keys with high probability. What is interesting will depend on the aspect we are trying to learn at a given moment, the hypothesis we are trying to check, etc.

This sampler strikes me as similar to the human processes of dreaming. By this I mean that we browse (i.e. sample) the space of possible events of interest, wandering randomly back and forth before jumping to a new area and wandering some more. As we do this we keep retrieving values, checking how the memories stored behave at each location… Of course dreaming goes beyond this, but seems like an interesting crude model.

In the previous example, we would need to “dream up” names, initially at random, check their ages, find some interesting hypothesis, and then dream up some more that are in the vicinity of the hypothesis or correlation to check.

Note that without this ability to browse in an intelligent way the space of key, it seems hard to think of an even remotely efficient learning algorithm…

Associative Memories

If we replace the associative array by an associative memory some things get better, but I think we still require dreaming. Without explicitly defining an associative memory, note that many instances of associative memories do not have a way to iterate over keys:
* a self-organizing (Kohonen) map
* a traditional map in which keys are fancy hashes of an original key, hashes that are locality preserving in some clever way but cannot be reversed.
* some human memory capacities also seem to lack a way to iterate over keys (e.g. try listing all words that you know in a given language).

In this case learning and generalization happens automatically by simply adding new patters, so no “dreaming” is necessary.

However intuitively it seems that dreaming is necessary here to re-learn the representation itself:
* prune and derive new features in a Kohonen map
* add new hashes, modify the hasing
* “make things click” in a human memory :)

A Thousand Exponentially Smaller Hyper-Sausages: a Toy Model for Learning Certain Complex Problems

2014-07-242016-12-19 Hugo ZaragozaLeave a comment

(WARNING: this is a technical text about Machine Learning, not meant for the general public.)

I am presenting a problem here, for which I don’t have a reasonable solution; if you do, please comment. I have found this problem in several forms when applying machine learning in real web search and NLP applications, but only recently I was able to formulate it clearly as a toy data model. I hope this simplification can draw the attention of more theoretically inclined machine learning people who can help…

A Toy Data Model

First consider the following data model:

the input data distribution is a mixture of very many components (e.g. gaussians) with relative sizes decreasing very fast.
the output data is some non-linear function of the input data distribution (a regression problem), smooth within components but using different feature subsets in each components.

In other words we are tackling a single regression problem in a single input space, but with a rich structure: many loosely coupled sub-problems, with very uneven relative sizes. We can picture this as:

a thousand hyper-sausages, each exponentially smaller than the next, exponentially smaller than the next…
within each hyper-sausage the function to be learnt is nice and smooth and uses a small subset of features (i.e. dimensions), but these differ from component to component
(and of course we don’t know a priori how many sausages there are, or their shape or position, and these intersect)

Example: In Web Search Ranking learning, over 50% of queries are simple navigational, for these you basically need features of in-link-ness and url match-ness. But then there are a smaller but important number of informational queries, for which you need fielded matching. A small fraction of these are proper name queries, for which you really need location or n-grams. A small fraction of these are proper name queries with an ambiguous surname matching a geographical location… etc.

The Problem

Now consider the error of an approximation to this regression problem. This cost can be decomposed linearly in terms of the components. But because smaller components are much smaller than larger ones, their effect on overall error would be negligible, even with a large training set. Larger and larger amounts of labelled data will marginally improve the overall performance on the largest component, but nothing will be learnt about the smaller ones.

All your “learning bandwith” is spent improving the precision of the model on the first component, where little new can be learnt. Intuitively you need to sample in some stratified manner, to obtain data about the smaller components. However you have several practical problems:

you’ll need to discover the stratification scheme as you go,
you’ll need to deal with the fact you will no longer have an i.i.d. sample,
you’ll need to re-define your cost function to promote learning on small components in some reasonable way…

An alternative is perhaps using an exponentially increasing loss, so larger mistakes dominate the learning no matter how rare; however… but this is also tricky for very many reasons in practice, such as convergence rate, outliers…

Why this matters?

When you are in this type of situation, and you don’t tackle it head-on, you become entangled in a web of confusion, partial solutions and endless discussions.

First it kills creativity: whatever you do to improve over an interesting sub-problem (new interesting features, new cost-functions, whatever) will never improve things “on average” (in fact, will worsen things for the first component).

Second, it leads to a hundred ad-hoc solutions full of dragons. Each engineer and researcher will go in a different direction, making all results incomparable and all discussions extremely long:

adding handpicked “difficult cases” to their training and testing samples,
over-sampling known problem cases (calling this bootstrapping, active learning, triage…)
inventing special loss functions (error metrics) that weight different thing differently
adding a battery of “validation” sets with different characteristics and choose models that behave reasonably well on them.

Worse, becomes a political problem. What is the relative importance of improving 10% on a small subproblem vs. improving 0.05% on average for all? Endless meaningless philosophical discussions follow, which in turn will spawn endless projects about sampling, validation, metrics…

[ Some Related Work: