About

Being labelled as social animals, communicating our thoughts plays a critical role for a healthy life. Spoken language communication is a skill which humans have evolved to use. In this post I intend to touch upon some questions. Feel free to use the comments option below to suggest (or recommend) any resources to dig further.

I read the article - Speech Acoustics of the World’s Languages by Tucker and Wright, in the Acoustic Today magazine. By the way, this is a freely available magazine, often with nice articles. Coming back to summarizing the article I read, the key highlights included:

  • Currently, there are more than 7000 spoken languages
  • Acoustic signal procesing research published in international journals has focussed mostly on English.
  • Lot remains unquestioned (and hence, also unanswered) about scientific understanding of speech production and perception of most of the languages.
  • Same holds when it comes to technology applications, such as automatic speech recognition and text-to-speech conversion. Most technology is designed for English.

A popular speech production model is the source-filter model. It assumes the sound originates at a source (e.g. vibration of vocal folds or vocal chords, lying horizontally in the larynx). As the air pressure travels through the vocal tract, the sound gets modified or filtered. The source-filter model when applied to different spoken sounds reveals insight into the physical configurations of the source and filter used during speaking different sounds. These insights have helped understand aspects such as pitch (or the fundamental frequency) and formant frequencies of the speaker.

The smallest unit of spoken speech sound is referred as phoneme. These include vowels and consonants. Every spoken language has a phoneme set. The phoneme set of two language can overlap. Does a language have more consonants or vowels? Let's see this by pulling and visualizing some data from Phoible. You can scroll the below code and jump to the plot.

# import some packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable # to move placement of colorbar
from matplotlib.ticker import (MultipleLocator, FormatStrFormatter,
                               AutoMinorLocator)
import seaborn as sns
sns.set() # Use seaborn's default style to make attractive graphs
sns.set_style("white")
sns.set_style("ticks")

############# load csv file
df = pd.read_csv('./my_data/vowel_consonant_languages.csv')

############ plot data
fig = plt.subplots(figsize=(6,6))
ax = plt.subplot(1,1,1)
ax.scatter(df['count_consonant'],df['count_vowel'],color='black',alpha=.5)
ax.set_xlabel('CONSONANT COUNT',fontsize=13)
ax.set_ylabel('VOWEL COUNT',fontsize=13) # A.U stands for Arbitrary Units
ax.grid(True)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)  
ax.xaxis.set_minor_locator(AutoMinorLocator())
ax.yaxis.set_minor_locator(AutoMinorLocator())
ax.tick_params(which='both', width=2)
ax.tick_params(which='major', length=7)
ax.tick_params(which='minor', length=4, color='gray')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.xlim([0,80])
plt.ylim([0,80])

fig = plt.subplots(figsize=(10,5))
ax = plt.subplot(1,1,1)
sns.distplot(df['count_vowel'],color='red',label='VOWEL')
sns.distplot(df['count_consonant'],color='blue',label='CONSONANT')
ax.grid(True)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_xlabel('COUNT ACROSS LANGUAGES',fontsize=13)
ax.set_ylabel('DENSITY',fontsize=13)
ax.legend(loc='upper right',frameon=False,fontsize=13)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.show()
len(df['count_consonant'])
2000

In the above plots, the left panel shows that all the 2000 datapoints (each denotes a language) have a higher count for consonant compared to vowels. The right panel depicts a histogram of the vowel and consonant counts across 2000 languages. The vowel count peaks at around 10 and the consonant count peaks at around 23. Also, the consonant count has more spread. Here is my first question.

Why are there more consonants than vowels in most spoken languages? Is there a theory to explain this. Is it more to do with limitations on speech production or perception abilities.

Phonotactics refers to how the phonemes are combined in the spoken language. Two languages can have the same phoneme set but can differ significantly in phonotactics. You cannot randomly combine phonemes and make an intelligibile speech sound. Every spoken language would have gone through a stage of evolution to come up with its own stable phonetactics. Here is my second question.

Given the phoneme set of two languages, is it possible to meaningfully quantify the similarity between the languages. I think, the answer has to also use the phonetactics. If yes, can we make a plot of how close is Hindi to other 8000 spoken languages. May be Duolingo will know more about this.

How are the origins of languages distributed on earth surface? I tried visualizing this by pulling data from the interesting Glottolog dataset. Here's what I did:

  • Downloaded the languages+geo location CSV data
  • Installed Basemap python package to overlay data on maps
  • Wrote few lines of code to make the visualization

Let's see the visualization of how 8125 languages are distributed on surface of earth. You can scroll the code and directly jump to the plot below.

from mpl_toolkits.basemap import Basemap

df = pd.read_csv('./my_data/languages_and_dialects_geo.csv')
df_new = df.dropna(subset=['latitude', 'longitude'])
df_new = df_new[df_new['level']=='language'].reset_index(drop=True)
macroarea = df_new['macroarea'].unique()
indx = []
for i in range(len(macroarea)):
    indx.append(df_new[df_new['macroarea']==macroarea[i]].index)

# Extract the data we're interested in
lat = df_new['latitude'].values
lon = df_new['longitude'].values

fig = plt.figure(figsize=(16, 8))
m = Basemap()
m.drawcoastlines()

clr = ['tab:green','blue','tab:red','magenta','black','darkred']
for i in range(len(macroarea)):
    m.scatter(lon[indx[i]], lat[indx[i]], latlon=True,alpha=0.2,color=clr[0])

In the above plot you can spot ...

  • most languages are distributed close to the equator. This might be because these places were more densely populated 1500-2000 years ago.

Interestingly, in 2006 Papua New Guinea had 832 living languages, making it the most linguistically diverse place on Earth. My third question is ...

What necessitates a new language creation?

and my fourth question is ...

Can we cluster languages based on their acoustics? How will the cluster relate to geographical closeness of their origins.

That's it as of now.

Extras

From the aspect of language perception by humans, I came across The Great Language Game, a game were a user is asked to identify the spoken language by listening to an audio snippet. This fun project went on for few years, and the results are published here. The results are worth a look. I will add my summary after going though it.

  • To know more about language diversity on our Earth, you may find this paper useful.
  • Installing basemap in MAC:
    brew install geos
    pip3 install https://github.com/matplotlib/basemap/archive/master.zip