暂无图片
暂无图片
暂无图片
暂无图片
暂无图片

从Oracle迁移到GreenPlum(二)数据库字符集转换问题

原创 姚昕 2021-07-05
1348

在进行从Oracle数据库迁移到Greenplum数据库的过程中遇到一个比较棘手的问题

我们查询Oracle数据库的字符集为ZHS16GBK

SQL> select * from nls_database_parameters; PARAMETER VALUE ------------------------------ -------------------------------------------------------------------------------- NLS_LANGUAGE AMERICAN NLS_TERRITORY AMERICA NLS_CURRENCY $ NLS_ISO_CURRENCY AMERICA NLS_NUMERIC_CHARACTERS ., NLS_CHARACTERSET ZHS16GBK NLS_CALENDAR GREGORIAN NLS_DATE_FORMAT DD-MON-RR NLS_DATE_LANGUAGE AMERICAN NLS_SORT BINARY NLS_TIME_FORMAT HH.MI.SSXFF AM PARAMETER VALUE ------------------------------ -------------------------------------------------------------------------------- NLS_TIMESTAMP_FORMAT DD-MON-RR HH.MI.SSXFF AM NLS_TIME_TZ_FORMAT HH.MI.SSXFF AM TZR NLS_TIMESTAMP_TZ_FORMAT DD-MON-RR HH.MI.SSXFF AM TZR NLS_DUAL_CURRENCY $ NLS_COMP BINARY NLS_LENGTH_SEMANTICS BYTE NLS_NCHAR_CONV_EXCP FALSE NLS_NCHAR_CHARACTERSET UTF8 NLS_RDBMS_VERSION 11.2.0.4.0

而Greenplum数据库的只支持UTF-8字符集,并不支持GBK字符集,这也就意味着,从Orale迁移数据到Greenplum的时候,需要进行数据的字符转换

这就涉及到一个问题:

在GBK字符集中,中文字符占用2个字节,而在UTF-8字符集中,中文字符占用3个字节

所以如果在GBK字符集的Oracle数据库中,name字段设定varchar(10),能够存储5个汉字,那么如果迁移到Greenplum中,同样的5个汉字,则需要15个字节,才能够进行存储。

所以如果name varchar(10) 迁移到Greenplum数据的话,为了保障数据不会超限,那么name字段就要扩容1.5倍 name varchar(15)

这个时候,就产生一个需求,如果能够找出数据库中,所有包含中文的字段呢?
我们可以使用两个函数

select length('中国') from dual; --------------- 2 select lengthb('中国') from dual; --------------- 4 从中可以看出 LENGTH()函数用于获取对应字段的字符长度 LENGTHB()函数用于获取对应字段的字节长度

所以,我们可以使用length(column_name) <> lengthb(column_name)这样的逻辑,来判断字段中是否包含中文
写一个简单的小pyhton程序,来进行扫描

import cx_Oracle from multiprocessing.pool import ThreadPool import queue cx_Oracle.init_oracle_client(lib_dir="/Users/kiwi/Downloads/oraclient") class Worker(object): def __init__(self, conn_list,boss_q): self.boss_q = boss_q self.user,self.passwd,self.listener = conn_list def do(self, job): sql2_tmp = """update scott.chinese_table set cn_flag=1,modify_date=current_date where owner='{owner}' AND table_name='{tablename}' and column_name='{columnname}' and EXISTS (select * from {owner}.{tablename} where LENGTH({columnname}) <> lengthb({columnname}))""" schema,tableName,columnName=job conn = cx_Oracle.connect(self.user, self.passwd, self.listener) cursor = conn.cursor() sql2 = sql2_tmp.format(owner=schema,tablename=tableName,columnname=columnName) cursor.execute(sql2) conn.commit() cursor.close() conn.close() def run(self): while True: print(self.boss_q.qsize()) if self.boss_q.qsize() == 0: print("ALL JOB IS DONE! \n") return try: job = self.boss_q.get(timeout=6) self.do(job) except Exception as e: print(f"find error: {e}") def exec_ora_dml(*args): user, passwd, listener, sql = args conn = cx_Oracle.connect(user, passwd, listener) cursor = conn.cursor() cursor.execute(sql) conn.commit() cursor.close() conn.close() def exec_ora_select(conn_list,sql): user, passwd, listener = conn_list conn = cx_Oracle.connect(user, passwd, listener) cursor = conn.cursor() cursor.execute(sql) rows = cursor.fetchall() cursor.close() conn.close() return rows if __name__ == '__main__': user = 'xxx' passwd = 'xxxx' listener = 'xxx.xxx.xxx.xxx:1521/kiwi' conn_list = user, passwd, listener max_workers = 20 parallel = 10 workers = [] sql1 = """ select a.owner,a.table_name,a.column_name from dba_tab_columns a,( select owner,object_name from dba_objects where owner in ( select name from sys.user$ where type#=1 and spare4 is not null and datats# >3 and password is not null and to_char(ctime,'yyyy-mm-dd') <>(select to_char(min(ctime),'yyyy-mm-dd') from sys.user$) and name not in ('HR','OE','IX','SH','PM','BI') ) and object_type='TABLE' MINUS select owner,table_name from dba_external_tables where owner in ( select name from sys.user$ where type#=1 and spare4 is not null and datats# >3 and password is not null and to_char(ctime,'yyyy-mm-dd') <>(select to_char(min(ctime),'yyyy-mm-dd') from sys.user$) and name not in ('HR','OE','IX','SH','PM','BI') ) ) d where a.owner=d.owner and a.table_name=d.object_name and a.data_type in ('CHAR','VARCHAR2','NVARCHAR2') and mod(a.data_length,2)=0 """ rows = exec_ora_select(conn_list, sql1) boss_q = queue.Queue() if len(rows) != 0: for row in rows: boss_q.put(row) pool = ThreadPool(max_workers) for i in range(parallel): worker = Worker(conn_list,boss_q,) workers.append((worker,)) pool.starmap(lambda worker: worker.run(), workers) pool.close() pool.join()
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论