Need to Calculate “Lexical Distance” Among Languages of India

Came across this very interesting network visualization of Languages of Europe based on “Lexical Distance” (details in original post). Basically, languages more similar to one another tend to cluster together and these different clusters of similar languages tend to be quite dissimilar.

Occurred to me that a similar exercise was done on languages used in India, the results could be quite insightful. For one, it could tell us that the idea of using English as a bridge language is neither a natural nor an ideal solution to managing our linguistic plurality. Second , these clusters of proximate (i.e. more similar) languages could offer solutions for second/third language instructions in different states.

The possibilities are endless. I would appreciate your help in spreading the idea, so that some computationally inclined linguistic researcher can do this and related analyses for India’s Languages.


Lexical Distance Network Among the Major Languages of Europe


This chart shows the lexical distance — that is, the degree of overall vocabulary divergence — among the major languages of Europe.

The size of each circle represents the number of speakers for that language. Circles of the same color belong to the same language group. All the groups except for Finno-Ugric (in yellow) are in turn members of the Indo-European language family.

English is a member of the Germanic group (blue) within the Indo-European family. But thanks to 1066, William of Normandy, and all that, about 75% of the modern English vocabulary comes from French and Latin (ie the Romance languages, in orange) rather than Germanic sources. As a result, English (a Germanic language) and French (a Romance language) are actually closer to each other in lexical terms than Romanian (a Romance language) and French.

So why is English still considered a Germanic language? Two reasons. First, the most frequently used…

Author: harshT

Assistant Professor

