TROPICAL GEOGRAPHY ›› 2018, Vol. 38 ›› Issue (2): 255-263.doi: 10.13284/j.cnki.rddl.003007

Previous Articles     Next Articles

A Model for Optimizing Chinese Addresses’ Geocoding Results from Multiple Map APIs Based on Clustering and Classifying

LIAO Weiwei1,LIU Lin2,3,ZHOU Suhong1,Song Guangwen1,LI Qiuping1,LIU Kai1   

  1. (1.School of Geography Science and Planning,Center of Integrated Geographic Information Analysis,Sun Yat-sen University,Guangzhou 510275,China;2.Center of Geographic Information Analysis for Public Security,School of Geographic Sciences,Guangzhou University,Guangzhou 510006,China;3.Department of Geography,University of Cincinnati,Cincinnati OH45221-0131,USA)
  • Online:2018-03-05 Published:2018-03-05

Abstract: Online geocoding services are capable of transforming text-based addresses into spatial positional data on the map conveniently and efficiently. However, inaccuracies exist in all online geocoding services, and the quantity differs among the services providers. Therefore, it is necessary to filter and optimize geocoded results for improving the accuracy. In this paper, we developed a multi-source integration model for optimizing online geocoding results of Chinese addresses based on a clustering algorithm. The model assesses inconsistencies among various online geocoding providers and come up with an optimal result. It is capable of fast geocoding of massive collections of Chinese addresses efficiently using the application programming interfaces (APIs) of the online geocoding services, including Amap, Baidu and Tencent. First, data-cleaning rules are applied to examine whether the online geocoding results are credible or not. Then, the credible geocoding results are further improved through a random forest based on clustering optimization algorithm. A training address sample with known precise location, consisting of 2000 addresses of theft, is clustered based on hierarchical clustering method. The addresses are divided into 8 groups through clustering and then used to train the random forest model, resulting an accuracy of 95.36%. The trained model is then validated using a second sample, also containing 2000 addresses of theft. Our experiments have found the following: 1) for addresses with mediocre level of standardization, Amap geocoding service has the highest quality, but still has the significant spatial inaccuracy; 2) the spatial confidence values and geographic level fed back from geocoding APIs are capable of reflecting the quality of geocoding; 3) locational accuracy of the model is significantly higher than those of the three providers. Overall, for the training sample, the mean of Amap’s error distance is up to 590.43 m. The model improves the accuracy to 173.73 m, with 87.78% of the addresses geocoded. For the validation sample, the model improves the mean of error distance from 554.88 m to 180.04 m, with 90.08% of the abnormal geocoding results from Amap rejected. The accuracy and geocoding success rates of the two samples are rather similar; and 4) the model is able to optimize geocoding results of addresses in both urban and suburb areas with comparable accuracies, which suggests that the model can be widely applicable. In sum, the model is capable of converting massive text-based Chinese addresses into spatial locations effectively and efficiently, and improving online geocoding accuracy through clustering and optimization.

Key words: online geocoding service, data cleaning, hierarchical clustering method, random forest