pandas处理json数据


将json串解析为DataFrame的方式主要有三种:

  1. 利用pandas自带的read_json直接解析字符串
  2. 利用json的loads和pandas的json_normalize进行解析
  3. 利用json的loads和pandas的DataFrame直接构造(这个过程需要手动修改loads得到的字典格式)


由于read_json直接对字符串进行的解析,其效率是最高的,但是其对JSON串的要求也是最高的,需要满足其规定的格式才能够读取。其支持的格式可以在pandas的官网点击打开链接可以看到。然而json_normalize是解析json串构造的字典的,其灵活性比read_json要高很多。但是令人意外的是,其效率还不如我自己解析来得快(自己解析时使用列表解析的功能比普通的for循环快很多)。当然最灵活的还是自己解析,可以在构造DataFrame之前进行一些简单的数据处理。

# -*- coding: UTF-8 -*-from pandas.io.json import json_normalizeimport pandas as pdimport jsonimport time # 读入数据data_str = open('data.json').read()print data_str # 测试json_normalizestart_time = time.time()for i in range(0, 300):    data_list = json.loads(data_str)    df = json_normalize(data_list)end_time = time.time()print end_time - start_time # 测试自己构造start_time = time.time()for i in range(0, 300):    data_list = json.loads(data_str)    data = [[d['timestamp'], d['value']] for d in data_list]    df = pd.DataFrame(data, columns=['timestamp', 'value'])end_time = time.time()print end_time - start_time #  测试read_jsonstart_time = time.time()for i in range(0, 300):    df = pd.read_json(data_str, orient='records')end_time = time.time()print end_time - start_time

 








pandas里的read_json函数可以将json数据转化为dataframe。   pandas.read_json的语法如下:  

pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')

 

第一参数就是json文件路径或者json格式的字符串。

第二参数orient是表明预期的json字符串格式。orient的设置有以下几个值:

(1).'split' : dict like {index -> [index], columns -> [columns], data -> [values]}

这种就是有索引,有列字段,和数据矩阵构成的json格式。key名称只能是index,columns和data。

import pandas as pds='{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}'print(pd.read_json(s,orient='split'))

 

运行结果:

   a  b1  1  32  2  83  3  9

 


 (2).   'records' : list like [{column -> value}, ... , {column -> value}]

这种就是成员为字典的列表。构成是列字段为键,值为键值,每一个字典成员就构成了dataframe的一行数据。

import pandas as pds='[{"name":"xiaomaimiao","age":20},{"name":"xxt","age":18},{"name":"xmm","age":1}]'print(pd.read_json(s,orient='records'))

 

运行结果:

   age         name0   20  xiaomaimiao1   18          xxt2    1          xmm

 

再例如:

# coding=utf-8import pandas as pdpd.set_option('display.max_rows',500)pd.set_option('display.max_columns',500)pd.set_option('display.width',1000)s=open('a.json', encoding='UTF-8').read()df=pd.read_json(s,orient='records')print(df.head(5))# df.to_excel('pandas处理json1.xlsx', index=False, columns=["Company", "Job", "Location", "Name", "MajorTag","University"])df.to_excel('pandas处理json1.xlsx', index=False)

 

运行结果:

数据:  a.json.zip

或:  https://raw.githubusercontent.com/lhrbest/Python/master/xxt_test_json.json



 (3).  'index' : dict like {index -> {column -> value}}

以索引为key,以列字段构成的字典为键值。如:

import pandas as pds='{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}'print(pd.read_json(s,orient='index'))

 

 运行结果:  

   a   b0  1   21  9  11

 


 (4).   'columns' : dict like {column -> {index -> value}}

这种处理的就是以列为键,对应一个值字典的对象。这个字典对象以索引为键,以值为键值构成的json字符串。如下图所示:

import pandas as pds='{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}'print(pd.read_json(s,orient='columns'))

 

运行结果:

   a   b0  1   21  9  11

 


   (5).    'values' : just the values array

values这种我们就很常见了。就是一个嵌套的列表。里面的成员也是列表,2层的。

import pandas as pds='[["a",1],["b",2]]'print(pd.read_json(s,orient='values'))

 

运行结果:

   0  10  a  11  b  2

 





要处理的json字符串:

strtext='[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\{"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\{"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\{"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'

 

代码:

strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\{"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\{"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\{"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'df = pd.read_json(strtext, orient='records')df.to_excel('pandas处理json.xlsx', index=False, columns=["ttery", "issue", "code", "code1", "code2", "time"])

 

运行结果:  最终写入excel如下图:

 


再例如:

# coding=utf-8import pandas as pdpd.set_option('display.max_rows',500)pd.set_option('display.max_columns',500)pd.set_option('display.width',1000)s='[{"Location":{"L1":"北京","L2":"北京市","L3":""},"University":null,"StudentFlag":0,"_id":"5cd682d61c0acc002c7a9282","Teams":[{"TeamOid":"5cd685cbe32417002baf7fb8"}],"Name":"soyotec","Avatar":"//cdn.kesci.com/images/avatar/4.jpg","Job":"总经理","Company":"北京树优信息技术有限公司","MajorTag":null},{"Location":{"L1":"上海","L2":"上海市","L3":""},"University":null,"StudentFlag":0,"_id":"5cf094ffa0e904002bae85a1","Teams":[{"TeamOid":"5cf08ec12e1a1d002b3912ed"}],"Name":"ne12212","Avatar":"//cdn.kesci.com/images/avatar/1.jpg","Job":"产品经理","Company":"中国电信股份有限公司上海分公司","MajorTag":null},{"Location":{"L1":"上海","L2":"上海市","L3":""},"University":null,"StudentFlag":0,"_id":"5ce37539e60640002b771cfa","Teams":[{"TeamOid":"5ce3af2f0e87f8002caad07e"}],"Name":"图森未来","Avatar":"https://cdn.kesci.com/upload/image/pruh63s30o.jpg","Job":"AIWIN大赛","Company":"上海图森未来人工智能科技有限公司","MajorTag":null,"Signature":"L4级无人驾驶卡车"}]'print(pd.read_json(s,orient='records'))

 


结果:

                                              Avatar           Company      Job                             Location  MajorTag     Name  Signature  StudentFlag                                      Teams  University                       _id0                //cdn.kesci.com/images/avatar/4.jpg      北京树优信息技术有限公司      总经理  {'L1': '北京', 'L2': '北京市', 'L3': ''}       NaN  soyotec        NaN            0  [{'TeamOid': '5cd685cbe32417002baf7fb8'}]         NaN  5cd682d61c0acc002c7a92821                //cdn.kesci.com/images/avatar/1.jpg   中国电信股份有限公司上海分公司     产品经理  {'L1': '上海', 'L2': '上海市', 'L3': ''}       NaN  ne12212        NaN            0  [{'TeamOid': '5cf08ec12e1a1d002b3912ed'}]         NaN  5cf094ffa0e904002bae85a12  https://cdn.kesci.com/upload/image/pruh63s30o.jpg  上海图森未来人工智能科技有限公司  AIWIN大赛  {'L1': '上海', 'L2': '上海市', 'L3': ''}       NaN     图森未来  L4级无人驾驶卡车            0  [{'TeamOid': '5ce3af2f0e87f8002caad07e'}]         NaN  5ce37539e60640002b771cfa

 






示例:

from pandas.io.json import json_normalizedata = [{'id': 1, 'name': {'first': 'Coleen', 'last': 'Volk'}},              {'name': {'given': 'Mose', 'family': 'Regner'}},              {'id': 2, 'name': 'Faye Raker'}]print(json_normalize(data))

 

运行结果:

    id        name name.family name.first name.given name.last0  1.0         NaN         NaN     Coleen        NaN      Volk1  NaN         NaN      Regner        NaN       Mose       NaN2  2.0  Faye Raker         NaN        NaN        NaN       NaN

 

示例:

from pandas.io.json import json_normalizedata = [{'state': 'Florida',               'shortname': 'FL',               'info': {                    'governor': 'Rick Scott'               },               'counties': [{'name': 'Dade', 'population': 12345},                           {'name': 'Broward', 'population': 40000},                           {'name': 'Palm Beach', 'population': 60000}]},          {'state': 'Ohio',               'shortname': 'OH',               'info': {                    'governor': 'John Kasich'               },               'counties': [{'name': 'Summit', 'population': 1234},                            {'name': 'Cuyahoga', 'population': 1337}]}]result = json_normalize(data, 'counties', ['state', 'shortname',['info', 'governor']])print(result)

 

运行结果:


         name  population    state shortname info.governor0        Dade       12345  Florida        FL    Rick Scott1     Broward       40000  Florida        FL    Rick Scott2  Palm Beach       60000  Florida        FL    Rick Scott3      Summit        1234     Ohio        OH   John Kasich4    Cuyahoga        1337     Ohio        OH   John Kasich

 

示例:

from pandas.io.json import json_normalizedata = {'A': [1, 2]}result=json_normalize(data, 'A', record_prefix='Prefix.')print(result)

 

运行结果:

   Prefix.00         11         2

 


示例:

#coding=utf-8from pandas.io.json import json_normalizeimport json# 读入数据data_str = open('a.json',encoding='utf-8').read()data_list = json.loads(data_str)df = json_normalize(data_list)df.to_excel('1.xlsx', index=False )print(df)

 

运行结果:

a.json.zip

或:   https://raw.githubusercontent.com/lhrbest/Python/master/xxt_test_json.json


示例:

# coding=utf-8import jsonimport pandas as pdpd.set_option('display.max_rows',500)pd.set_option('display.max_columns',500)pd.set_option('display.width',1000)data_str=open('a.json', encoding='UTF-8').read()data_list = json.loads(data_str)data = [[d["Location"], d["Company"]] for d in data_list]df = pd.DataFrame(data, columns=["Company", "Location"])print(df.head(5))# df.to_excel('pandas处理json1.xlsx', index=False, columns=["Company", "Job", "Location", "Name", "MajorTag","University"])# df.to_excel('pandas处理json1.xlsx', index=False)

 

结果:

                               Company          Location0  {'L1': '北京', 'L2': '北京市', 'L3': ''}      北京树优信息技术有限公司1  {'L1': '上海', 'L2': '上海市', 'L3': ''}   中国电信股份有限公司上海分公司2  {'L1': '上海', 'L2': '上海市', 'L3': ''}  上海图森未来人工智能科技有限公司3  {'L1': '上海', 'L2': '上海市', 'L3': ''}              None4  {'L1': '上海', 'L2': '上海市', 'L3': ''}                AI

 


示例:

import jsonimport pandas as pddata_str = '''[{"department": "xxt", "query_result": {"code": "10", "description": "1000"},  "is_invoice": 1,  "imageName": "./imgs/8888888.jpeg",  "reco_result": {"total": "", "invoice_no": "01111111", "create_date": "", "check_code": "", "invoice_code": ""}},  {"department": "xxt2", "query_result": {},  "is_invoice": 0,  "imageName": "./imgs/51111111.jpeg", "reco_result": {}}]'''# 从文件读取# data_str = open('a.json',encoding='utf-8').read()# data_list = json.loads(data_str)# 从字符串直接获取data_list = json.loads(data_str)data1_all = [[d["department"], d["is_invoice"], d["imageName"]] for d in data_list]data2_all = [d["query_result"] for d in data_list]data5_all = [d["reco_result"] for d in data_list]# for i in data1_all:#     print("data1_0:",i[0])# print("data5",data5_all)# print("data1",data1_all)def get_data(data1_all, data2_all, data5_all):    col_value = []    for data1, data2, data5 in zip(data1_all, data2_all, data5_all):        department = data1[0]        is_invoice = data1[1]        imageName = data1[2]        if 'code' in data2:            code = str(data2).split(",")[0].split(":")[1].replace("'", "").replace("}", "")            description = str(data2).split(",")[1].split(":")[1].replace("'", "").replace("}", "")        else:            code = ""            description = ""        if 'total' in data5:            total = str(data5).split(",")[0].split(":")[1].replace("'", "").replace("}", "")            invoice_no = str(data5).split(",")[1].split(":")[1].replace("'", "").replace("}", "")            create_date = str(data5).split(",")[2].split(":")[1].replace("'", "").replace("}", "")            check_code = str(data5).split(",")[3].split(":")[1].replace("'", "").replace("}", "")            invoice_code = str(data5).split(",")[4].split(":")[1].replace("'", "").replace("}", "")        else:            total = ""            invoice_no = ""            create_date = ""            check_code = ""            invoice_code = ""        col_value.append((department, is_invoice, imageName, code, description, total, invoice_no, create_date,                          check_code, invoice_code))    return col_valuecol_value = get_data(data1_all, data2_all, data5_all)df = pd.DataFrame(col_value, index=None,                  columns=["department", "is_invoice", "imageName", "code", "description", "total", "invoice_no",                           "create_date", "check_code", "invoice_code"])df.to_excel('excel_pd.xls')print(df)

 


结果:

  department  is_invoice     ...      check_code invoice_code0        xxt           1     ...                             1       xxt2           0     ...                             [2 rows x 10 columns]

 





-----自己处理数据

import jsonimport pandas as pd# 从字符串直接获取# data_str='''{"Location":{"L1":"上海","L2":"上海市","L3":""},"University":null,"StudentFlag":0,"_id":"5ce37539e60640002b771cfa","Teams":[{"TeamOid":"5ce3af2f0e87f8002caad07e"}],"Name":"图森未来","Avatar":"https://cdn.kesci.com/upload/image/pruh63s30o.jpg","Job":"AIWIN大赛","Company":"上海图森未来人工智能科技有限公司","MajorTag":null,"Avatar":"L4级无人驾驶卡车"}'''# data_list = json.loads(data_str)# 从文件读取data_str = open('a.json',encoding='utf-8').read()data_list = json.loads(data_str)data_s_all = [[ d["StudentFlag"], d["Name"],d["Job"],d["Company"]] for d in data_list]data_Location_all = [d["Location"] for d in data_list]data_University_all = [d["University"] for d in data_list]data_MajorTag_all = [d["MajorTag"] for d in data_list]# for i in data1_all:#     print("data1_0:",i[0])# print("data5",data5_all)# print("data1",data1_all)def get_data(data_s_all, data_Location_all, data_University_all,data_MajorTag_all):    col_value = []    for data_s, data_Location, data_University,data_MajorTag in zip(data_s_all, data_Location_all, data_University_all,data_MajorTag_all):        StudentFlag = data_s[0]        Name = data_s[1]        Job = data_s[2]        Company = data_s[3]        Location_result=''        University_result=''        Major=''        # Location        try:            if 'L1' in data_Location:                L1 = str(data_Location).split(",")[0].split(":")[1].replace("'", "").replace("}", "")                L2 = str(data_Location).split(",")[1].split(":")[1].replace("'", "").replace("}", "")                L3 = str(data_Location).split(",")[2].split(":")[1].replace("'", "").replace("}", "")                Location_result=L1+L2+L3            else:                L1 = ""                L2 = ""                L3 = ""        except Exception as e:            None        # Country        try:            if 'Country' in data_University:                Country = str(data_University).split(",")[0].split(":")[1].replace("'", "").replace("}", "")                Province = str(data_University).split(",")[1].split(":")[1].replace("'", "").replace("}", "")                University = str(data_University).split(",")[2].split(":")[1].replace("'", "").replace("}", "")                University_result=Country+Province+University            else:                Country = ""                Province = ""                University = ""        except Exception as e:            None        # Major        try:            if 'Major' in data_MajorTag:                Major=str(data_MajorTag).split(",")[1].split(":")[1].replace("'", "").replace("}", "")            else:                Major = ""        except Exception as e:            None        col_value.append((Name, StudentFlag, Job, Company,  Location_result, University_result, Major))    return col_valuecol_value = get_data(data_s_all, data_Location_all, data_University_all,data_MajorTag_all)df = pd.DataFrame(col_value, index=None,columns=["Name", "StudentFlag", "Job", "Company", "Location_result", "University_result", "Major"])df.to_excel('excel_pd2.xls')print(df.head(3))

 


 运行结果:

     Name  StudentFlag      Job  ...  Location_result University_result Major0  soyotec            0      总经理  ...          北京 北京市                         1  ne12212            0     产品经理  ...          上海 上海市                         2     图森未来            0  AIWIN大赛  ...          上海 上海市

 




技巧  :将复杂的json串整理成以下格式再读取,再使用  data_list = json.loads(data_str)  读取即可

{"error_code":40007,"error_msg":"fail to recognize"}[{"department": "abcdef", "query_result": {"code": "1000", "description": "1000"},  "is_invoice": 1,  "imageName": "./imgs/8888888.jpeg",  "reco_result": {"total": "", "invoice_no": "123", "create_date": "", "check_code": ""}}]1234567

 

批量读取 json 文件(中文 json)

./out_file  下两个json文件内容如下:

out_01.txt 内容为:"{"name_ID":"12343","name":"张三","身份编码":"未知"}"out_02.txt 内容为:"{"name_ID":"12344","name":"李四","身份编码":"98983"}"12

 

import jsonimport osdef img_w_h(text_path):    data_str_list = []    img_name_list = []    for filename in os.listdir(text_path):        file_path = text_path+'/'+filename        print("获取文件:",file_path)        data_str = open(file_path,"r",encoding='UTF-8').read()        data_str_list.append(data_str)        img_name_list.append(filename)    print("data_str_list",data_str_list)    return data_str_list,img_name_listdef json_to_excel(data_str_list):    data_all = []    for data_str in data_str_list:        if data_str.startswith(u'\ufeff'):            content = data_str.encode('utf8')[3:].decode('utf8')            text = json.loads(content[1:-1])            if text["身份编码"] =="未知":                data_all.append(text["身份编码"])    return data_allif __name__ == "__main__":    text_path = "./out_file"    data_str_list, img_name_list = img_w_h(text_path)    data_all = json_to_excel(data_str_list)    print("data_all:",data_all)输出:获取文件: ./out_file/out_01.txt获取文件: ./out_file/out_02.txtdata_str_list ['\ufeff"{"name_ID":"12343","name":"张三","身份编码":"98983"}"', '\ufeff"{"name_ID":"12343","name":"张三","身份编码":"未知"}"']data_all: ['未知']123456789101112131415161718192021222324252627282930313233343536

 

复杂json格式解析 保存 Excel

import jsonimport pandas as pd"""数据格式一(为方便查看格式化如下):[{"department": "abcdef", "query_result": {"code": "1000", "description": "1000"},  "is_invoice": 1,  "imageName": "./imgs/8888888.jpeg",  "reco_result": {"total": "", "invoice_no": "01111111", "create_date": "", "check_code": "", "invoice_code": ""}},  {"department": "abcdef", "query_result": {},  "is_invoice": 0,  "imageName": "./imgs/51111111.jpeg",  "reco_result": {}}, ...]"""data_str = open('json_img.json').read()data_list = json.loads(data_str)data1_all = [[d["department"], d["is_invoice"], d["imageName"]] for d in data_list]data2_all = [d["query_result"] for d in data_list]data5_all = [d["reco_result"] for d in data_list]for i in data1_all:    print("data1_0:",i[0])print("data5",data5_all)print("data1",data1_all)def get_data(data1_all,data2_all,data5_all):    col_value = []    for data1,data2,data5 in zip(data1_all,data2_all,data5_all):        department = data1[0]        is_invoice = data1[1]        imageName = data1[2]        if 'code' in data2:            code = str(data2).split(",")[0].split(":")[1].replace("'", "").replace("}", "")            description = str(data2).split(",")[1].split(":")[1].replace("'", "").replace("}", "")        else:            code = "NAN"            description = "NAN"        if 'total' in data5:            total = str(data5).split(",")[0].split(":")[1].replace("'", "").replace("}", "")            invoice_no = str(data5).split(",")[1].split(":")[1].replace("'", "").replace("}", "")            create_date = str(data5).split(",")[2].split(":")[1].replace("'", "").replace("}", "")            check_code = str(data5).split(",")[3].split(":")[1].replace("'", "").replace("}", "")            invoice_code = str(data5).split(",")[4].split(":")[1].replace("'", "").replace("}", "")        else:            total = "NAN"            invoice_no = "NAN"            create_date = "NAN"            check_code = "NAN"            invoice_code = "NAN"        col_value.append((department,is_invoice,imageName, code,description, total, invoice_no, create_date, check_code, invoice_code))    return col_valuecol_value = get_data(data1_all,data2_all,data5_all)df = pd.DataFrame(col_value, index=None,columns=["department", "is_invoice", "imageName", "code", "description", "total", "invoice_no", "create_date", "check_code", "invoice_code"])df.to_excel('excel_pd.xls')1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465

 

读取excel中某列的json数据(每个单元格数据格式为上面:数据格式一)

import jsonimport pandas as pdimport xlrdexcel_path = "C:\\Users\\Desktop\\test_data.xlsx"def read_excel(excel_path):    workbook = xlrd.open_workbook(excel_path)    sheet = workbook.sheet_by_name("Sheet1")    nrows = sheet.nrows    list1 = []    for i in range(1,nrows):        list1.append(sheet.row_values(i)[0])    return list1def get_data(excel_path):    list1 = read_excel(excel_path)    All_data = []    for i in range(len(list1)):          #遍历列表数据(相当于遍历该列所有单元格)        data_list = json.loads(list1[i])        # print("data_list:", type(data_list))        for i in range(len(data_list)): #遍历该单元格列表中所有json串            # print(type(data_list[i]))            data_dict = data_list[i]            try:                imageNo = data_dict["imageNo"]                businessType = data_dict["businessType"]                reco_result = data_dict["reco_result"]                try:                    total = reco_result["total"]                    invoice_no = reco_result["invoice_no"]                    create_date = reco_result["create_date"]                    check_code = reco_result["check_code"]                    invoice_code = reco_result["invoice_code"]                except:                    total = "NAN"                    invoice_no = "NAN"                    create_date = "NAN"                    check_code = "NAN"                    invoice_code = "NAN"                is_invoice = data_dict["is_invoice"]                billId = data_dict["billId"]                imageName = data_dict["imageName"]                applyNum = data_dict["applyNum"]                department = data_dict["department"]                query_result = data_dict["query_result"]                try:                    code = query_result["code"]                    description = query_result["description"]                except:                    code = "NAN"                    description = "NAN"                All_data.append((imageNo, businessType, total, invoice_no, create_date, check_code,                                 invoice_code, is_invoice, billId, imageName,                                 applyNum, department, code, description))            except:                print("数据格式出错!")                pass    return All_dataAll_data = get_data(excel_path)df = pd.DataFrame(All_data, index=None,columns=["imageNo", "businessType", "total","invoice_no", "create_date", "check_code", \                                                 "invoice_code","is_invoice","billId","imageName",\                                                 "applyNum","department","code","description"])df.to_excel('C:\\Users\\Desktop/001.xls')print("done!")12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667

 

特殊json文件格式化

最原始数据:{"41196516":"{\"type\":\"身份证正面\",\"name\":\"徐XX\",\"sex\":\"男\",\"people\":\"汉\",...,"41196243":"{\"error_code\"处理成如下json文件:(非常不正规){"41196516":"{"type":"身份证正面","name":"徐XX","sex":"男","people":"汉","birthday":"19XX年7月XX日","address":"广州市花都区*****号","id_number":"4401***15","issue_authority":"广州市XXX局","validity":"20XX.XX.13-20XX.XX.13","time_cost":{"recognize":348,"preprocess":28},"complete":true,"border_covered":false,"head_covered":false,"head_blurred":false,"gray_image":true,"error_code":0,"error_msg":"OK"}","41196243":"{"error_code":40007,"error_msg":"fail to recognize"}","41196510":"{"type":"二代身份证","name":"魏XX","sex":"男","people":"汉","birthday":"19XX年9月XX日","address":"江苏省江阴市XXX号","id_number":"320XXX17","time_cost":{"recognize":398,"preprocess":29},"complete":true,"border_covered":false,"head_covered":false,"head_blurred":false,"gray_image":false,"error_code":0,"error_msg":"OK"}","41197139":"{"type":"身份证背面","issue_authority":"佛山市XXX分局","validity":"2005.XX.XX-2025.XX.XX","time_cost":{"recognize":464,"preprocess":48},"complete":true,"error_code":0,"error_msg":"OK"}"}格式化展示:{"41196516":"{"type":"身份证正面",  "name":"徐XX",  "sex":"男",  "people":"汉",  "birthday":"19XX年7月XX日",  "address":"广州市花都区*****号",  "id_number":"4401***15",  "issue_authority":"广州市XXX局",  "validity":"20XX.XX.13-20XX.XX.13",  "time_cost":{"recognize":348,"preprocess":28},  "complete":true,  "border_covered":false,  "head_covered":false,  "head_blurred":false,  "gray_image":true,  "error_code":0,  "error_msg":"OK"}","41196243":"{"error_code":40007,"error_msg":"fail to recognize"}","41196510":"{"type":"二代身份证", "name":"魏XX", "sex":"男", "people":"汉", "birthday":"19XX年9月XX日", "address":"江苏省江阴市XXX号", "id_number":"320XXX17", "time_cost":{"recognize":398,"preprocess":29}, "complete":true, "border_covered":false, "head_covered":false, "head_blurred":false, "gray_image":false, "error_code":0, "error_msg":"OK"}","41197139":"{"type":"身份证背面",    "issue_authority":"佛山市XXX分局","validity":"2005.XX.XX-2025.XX.XX","time_cost":{"recognize":464,"preprocess":48},"complete":true,"error_code":0,"error_msg":"OK"}"}123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051

 

解析代码如下:

import jsonimport pandas as pddata_str = open('D:/XXX/XXX文档/reize_result20181227.txt',"r",encoding="utf-8").read()data_str0 = data_str.replace("\\","")print(data_str0)imgName_list = []def get_data(data_str0):    All_data = []    num = data_str0.count("error_code")                        #统计共有多少个json("error_code"每个json都有)    for i in range(num):        imgName = data_str0[1:-1].split("\":\"{")[i][-8:]      #获取ImageID([1:-1]去除最外层括号)        print("imgName", imgName)        img_str1 = "{"+data_str0[1:-1].split("\":\"{")[i+1].split("}\",\"")[0]+"}"         #获取整个json        img_str1 = img_str1.replace("\"}\"}}","\"}") if "\"}\"}" in img_str1 else img_str1 #去除末尾多余的符号        print("img_str1:", img_str1)        data_list = json.loads(img_str1)        #########################################################################        try:            type_ = data_list["type"]        except:            type_ = "NAN"        try:            name = data_list["name"]        except:            name = "NAN"        try:            sex = data_list["sex"]        except:            sex = "NAN"        try:            people = data_list["people"]        except:            people = "NAN"        try:            birthday = data_list["birthday"]        except:            birthday = "NAN"        try:            address = data_list["address"]        except:            address = "NAN"        try:            id_number = data_list["id_number"]        except:            id_number = "NAN"        try:            issue_authority = data_list["issue_authority"]        except:            issue_authority = "NAN"        try:            validity = data_list["validity"]        except:            validity = "NAN"        try:            time_cost = data_list["time_cost"]            recognize = time_cost["recognize"]            preprocess = time_cost["preprocess"]        except:            time_cost = "NAN"            recognize = "NAN"            preprocess = "NAN"        try:            complete = data_list["complete"]        except:            complete = "NAN"        try:            border_covered = data_list["border_covered"]        except:            border_covered = "NAN"        try:            head_covered = data_list["head_covered"]        except:            head_covered = "NAN"        try:            head_blurred = data_list["head_blurred"]        except:            head_blurred = "NAN"        try:            gray_image = data_list["gray_image"]        except:            gray_image = "NAN"        error_code = data_list["error_code"]        error_msg = data_list["error_msg"]        All_data.append((imgName,type_,name,sex,people,birthday,\                         address,id_number,issue_authority,validity,\                         recognize,preprocess,complete,border_covered,\                         head_covered,head_blurred,gray_image,error_code,error_msg))    return All_dataAll_data = get_data(data_str0)df = pd.DataFrame(All_data, index=None,columns=["imgName", "type_", "name","sex", "people", "birthday", \                                                 "address","id_number","issue_authority","validity",\                                                 "recognize","preprocess","complete","border_covered",\                                                 "head_covered","head_blurred","gray_image","error_code","error_msg"])df.to_excel('D:/XXX/XXX文档/reize_result20181227.xls')123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107

 

复杂json解析

  • 报错1:  json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)  
    原因在于:字符串里用单引号来标识字符。
    解决办法如下:将字符串里的  单引号  替换成  双引号

写json文件

import jsonimport osdef get_img(file_path):img_path = []for path,dirname,filenames in os.walk(file_path):for filename in filenames:img_path.append(path+"/"+filename)return img_pathdef json_str(file_path):dict_str = []img_path = get_img(file_path)for i in img_path:dict_str.append({"ImageName":"/image/bus/"+i,"id":"8abs63twy2001"})return dict_strfile_path = "./image/ocr"dict_str = json_str(file_path)json_str = json.dumps(dict_str)with open("./dict_str_to_json.json","w") as json_file:json_file.write(json_str)print("done!")12345678910111213141516171819202122232425

    

©著作权归作者所有:来自51CTO博客作者小麦苗DB宝的原创作品,如需转载,请注明出处,否则将追究法律责任

更多相关文章

  1. 上海老男孩IT教育:Linux运维入门教程09-01 (Samba服务)
  2. 我在上海实习的日子。
  3. Linux修改系统时间为东八区北京时间(上海时间)
  4. 怎样才能选择一个好的运维服务商?
  5. IDC服务商
  6. 数据中心
  7. 特斯拉上海超级工厂监控遭“泄密”,究竟数据该如何“加密”?
  8. 2012移动开发者大会上海站即将召开
  9. Android解析JSON方式(一)服务器端生成JSON数据

随机推荐

  1. android 导入项目报错
  2. LinearLayout布局实现垂直水平居中
  3. Android安卓布局简介
  4. Android图表控件MPAndroidChart——曲线
  5. android与j2me移植之clipRect
  6. Android(安卓)文件系统获取root权限
  7. Android Studio 获取SHA1
  8. android使用Glide加载RelativeLayout、Li
  9. android UI小结(一)
  10. 安装Android SDK出现Failed to fetch URL