热带地理 ›› 2018, Vol. 38 ›› Issue (2): 255-263.doi: 10.13284/j.cnki.rddl.003007

• 论文 • 上一篇    下一篇

多源在线地理编码服务分类优化模型

廖薇薇1,柳 林2,3,周素红1,宋广文1,李秋萍1,刘 凯1   

  1. (1.中山大学地理科学与规划学院 综合地理信息研究中心,广州 510275;2.广州大学地理科学学院 公共安全地理信息分析中心, 广州 510006;3.辛辛那提大学地理系,辛辛那提 OH45221-0131,美国)
  • 出版日期:2018-03-05 发布日期:2018-03-05
  • 通讯作者: 柳林(1965―),男,湖南湘潭人,博士,教授,博士生导师,主要从事犯罪空间模拟、多智能体模拟、GIS应用等研究,(E-mail) liulin2@mail.sysu.edu.cn。
  • 作者简介:廖薇薇(1993―),女,广东汕头人,硕士,主要研究方向为犯罪地理学、时空数据挖掘,(E-mail)liaoww3@sina.com;
  • 基金资助:
    国家自然科学重点基金项目(41531178);国家自然科学基金优秀青年基金项目(41522104);广东省自然科学基金研究团队项目(2014A030312010);广东省科技计划项目(2015A020217003)

A Model for Optimizing Chinese Addresses’ Geocoding Results from Multiple Map APIs Based on Clustering and Classifying

LIAO Weiwei1,LIU Lin2,3,ZHOU Suhong1,Song Guangwen1,LI Qiuping1,LIU Kai1   

  1. (1.School of Geography Science and Planning,Center of Integrated Geographic Information Analysis,Sun Yat-sen University,Guangzhou 510275,China;2.Center of Geographic Information Analysis for Public Security,School of Geographic Sciences,Guangzhou University,Guangzhou 510006,China;3.Department of Geography,University of Cincinnati,Cincinnati OH45221-0131,USA)
  • Online:2018-03-05 Published:2018-03-05

摘要: 利用在线地理编码API解决海量中文地址快速编码问题,在此基础上,利用简单的规则对编码结果进行清洗、标记,最后通过基于系统聚类与随机森林的分类优化模型,将多平台编码结果分类处理、优化。利用广州市盗窃案件地址对模型进行训练与验证,结果表明:相比未处理的编码结果,经模型优化过的编码结果整体位置误差距离减小。高德的地理编码服务有着最好的编码质量,但训练样本的高德编码误差均值仍高达590.43 m,经模型优化后,样本的编码误差均值降至173.73 m,验证样本编码误差均值由554.88 m(高德)降至180.04 m,降低了67.49%,高德90.08%的异常编码结果被清洗优化。对于训练样本与验证样本,模型优化效果相似;对于地址类型不同的案件、位于市区与市郊的案件,模型优化效果相似,说明模型具有一定普适性。该模型能够方便快捷地将海量社会经济信息转化为空间数据,提高编码精度,为地理大数据的研究提供更好的数据支持。

关键词: 在线地理编码, 数据清洗, 系统聚类, 随机森林

Abstract: Online geocoding services are capable of transforming text-based addresses into spatial positional data on the map conveniently and efficiently. However, inaccuracies exist in all online geocoding services, and the quantity differs among the services providers. Therefore, it is necessary to filter and optimize geocoded results for improving the accuracy. In this paper, we developed a multi-source integration model for optimizing online geocoding results of Chinese addresses based on a clustering algorithm. The model assesses inconsistencies among various online geocoding providers and come up with an optimal result. It is capable of fast geocoding of massive collections of Chinese addresses efficiently using the application programming interfaces (APIs) of the online geocoding services, including Amap, Baidu and Tencent. First, data-cleaning rules are applied to examine whether the online geocoding results are credible or not. Then, the credible geocoding results are further improved through a random forest based on clustering optimization algorithm. A training address sample with known precise location, consisting of 2000 addresses of theft, is clustered based on hierarchical clustering method. The addresses are divided into 8 groups through clustering and then used to train the random forest model, resulting an accuracy of 95.36%. The trained model is then validated using a second sample, also containing 2000 addresses of theft. Our experiments have found the following: 1) for addresses with mediocre level of standardization, Amap geocoding service has the highest quality, but still has the significant spatial inaccuracy; 2) the spatial confidence values and geographic level fed back from geocoding APIs are capable of reflecting the quality of geocoding; 3) locational accuracy of the model is significantly higher than those of the three providers. Overall, for the training sample, the mean of Amap’s error distance is up to 590.43 m. The model improves the accuracy to 173.73 m, with 87.78% of the addresses geocoded. For the validation sample, the model improves the mean of error distance from 554.88 m to 180.04 m, with 90.08% of the abnormal geocoding results from Amap rejected. The accuracy and geocoding success rates of the two samples are rather similar; and 4) the model is able to optimize geocoding results of addresses in both urban and suburb areas with comparable accuracies, which suggests that the model can be widely applicable. In sum, the model is capable of converting massive text-based Chinese addresses into spatial locations effectively and efficiently, and improving online geocoding accuracy through clustering and optimization.

Key words: online geocoding service, data cleaning, hierarchical clustering method, random forest