(WARNING: this is a technical text about Machine Learning, not meant for the general public.)
I am presenting a problem here, for which I don’t have a reasonable solution; if you do, please comment. I have found this problem in several forms when applying machine learning in real web search and NLP applications, but only recently I was able to formulate it clearly as a toy data model. I hope this simplification can draw the attention of more theoretically inclined machine learning people who can help…
A Toy Data Model
First consider the following data model:
- the input data distribution is a mixture of very many components (e.g. gaussians) with relative sizes decreasing very fast.
- the output data is some non-linear function of the input data distribution (a regression problem), smooth within components but using different feature subsets in each components.
In other words we are tackling a single regression problem in a single input space, but with a rich structure: many loosely coupled sub-problems, with very uneven relative sizes. We can picture this as:
- a thousand hyper-sausages, each exponentially smaller than the next, exponentially smaller than the next…
- within each hyper-sausage the function to be learnt is nice and smooth and uses a small subset of features (i.e. dimensions), but these differ from component to component
- (and of course we don’t know a priori how many sausages there are, or their shape or position, and these intersect)
Example: In Web Search Ranking learning, over 50% of queries are simple navigational, for these you basically need features of in-link-ness and url match-ness. But then there are a smaller but important number of informational queries, for which you need fielded matching. A small fraction of these are proper name queries, for which you really need location or n-grams. A small fraction of these are proper name queries with an ambiguous surname matching a geographical location… etc.
Now consider the error of an approximation to this regression problem. This cost can be decomposed linearly in terms of the components. But because smaller components are much smaller than larger ones, their effect on overall error would be negligible, even with a large training set. Larger and larger amounts of labelled data will marginally improve the overall performance on the largest component, but nothing will be learnt about the smaller ones.
All your “learning bandwith” is spent improving the precision of the model on the first component, where little new can be learnt. Intuitively you need to sample in some stratified manner, to obtain data about the smaller components. However you have several practical problems:
- you’ll need to discover the stratification scheme as you go,
- you’ll need to deal with the fact you will no longer have an i.i.d. sample,
- you’ll need to re-define your cost function to promote learning on small components in some reasonable way…
An alternative is perhaps using an exponentially increasing loss, so larger mistakes dominate the learning no matter how rare; however… but this is also tricky for very many reasons in practice, such as convergence rate, outliers…
Why this matters?
When you are in this type of situation, and you don’t tackle it head-on, you become entangled in a web of confusion, partial solutions and endless discussions.
First it kills creativity: whatever you do to improve over an interesting sub-problem (new interesting features, new cost-functions, whatever) will never improve things “on average” (in fact, will worsen things for the first component).
Second, it leads to a hundred ad-hoc solutions full of dragons. Each engineer and researcher will go in a different direction, making all results incomparable and all discussions extremely long:
- adding handpicked “difficult cases” to their training and testing samples,
- over-sampling known problem cases (calling this bootstrapping, active learning, triage…)
- inventing special loss functions (error metrics) that weight different thing differently
- adding a battery of “validation” sets with different characteristics and choose models that behave reasonably well on them.
Worse, becomes a political problem. What is the relative importance of improving 10% on a small subproblem vs. improving 0.05% on average for all? Endless meaningless philosophical discussions follow, which in turn will spawn endless projects about sampling, validation, metrics…
[ Some Related Work:
- Web Search Solved? All Result Rankings the Same?
H. Zaragoza and B. B. Cambazoglu and R. Baeza-Yates, CIKM 2010.
- Incremental Sampling of Query Logs
R. Baeza-Yates, SIGIR’15.