在进行从Oracle数据库迁移到Greenplum数据库的过程中遇到一个比较棘手的问题
我们查询Oracle数据库的字符集为ZHS16GBK
SQL> select * from nls_database_parameters;
PARAMETER VALUE
------------------------------ --------------------------------------------------------------------------------
NLS_LANGUAGE AMERICAN
NLS_TERRITORY AMERICA
NLS_CURRENCY $
NLS_ISO_CURRENCY AMERICA
NLS_NUMERIC_CHARACTERS .,
NLS_CHARACTERSET ZHS16GBK
NLS_CALENDAR GREGORIAN
NLS_DATE_FORMAT DD-MON-RR
NLS_DATE_LANGUAGE AMERICAN
NLS_SORT BINARY
NLS_TIME_FORMAT HH.MI.SSXFF AM
PARAMETER VALUE
------------------------------ --------------------------------------------------------------------------------
NLS_TIMESTAMP_FORMAT DD-MON-RR HH.MI.SSXFF AM
NLS_TIME_TZ_FORMAT HH.MI.SSXFF AM TZR
NLS_TIMESTAMP_TZ_FORMAT DD-MON-RR HH.MI.SSXFF AM TZR
NLS_DUAL_CURRENCY $
NLS_COMP BINARY
NLS_LENGTH_SEMANTICS BYTE
NLS_NCHAR_CONV_EXCP FALSE
NLS_NCHAR_CHARACTERSET UTF8
NLS_RDBMS_VERSION 11.2.0.4.0
而Greenplum数据库的只支持UTF-8字符集,并不支持GBK字符集,这也就意味着,从Orale迁移数据到Greenplum的时候,需要进行数据的字符转换
这就涉及到一个问题:
在GBK字符集中,中文字符占用2个字节,而在UTF-8字符集中,中文字符占用3个字节
所以如果在GBK字符集的Oracle数据库中,name字段设定varchar(10),能够存储5个汉字,那么如果迁移到Greenplum中,同样的5个汉字,则需要15个字节,才能够进行存储。
所以如果name varchar(10) 迁移到Greenplum数据的话,为了保障数据不会超限,那么name字段就要扩容1.5倍 name varchar(15)
这个时候,就产生一个需求,如果能够找出数据库中,所有包含中文的字段呢?
我们可以使用两个函数
select length('中国') from dual;
---------------
2
select lengthb('中国') from dual;
---------------
4
从中可以看出
LENGTH()函数用于获取对应字段的字符长度
LENGTHB()函数用于获取对应字段的字节长度
所以,我们可以使用length(column_name) <> lengthb(column_name)这样的逻辑,来判断字段中是否包含中文
写一个简单的小pyhton程序,来进行扫描
import cx_Oracle
from multiprocessing.pool import ThreadPool
import queue
cx_Oracle.init_oracle_client(lib_dir="/Users/kiwi/Downloads/oraclient")
class Worker(object):
def __init__(self, conn_list,boss_q):
self.boss_q = boss_q
self.user,self.passwd,self.listener = conn_list
def do(self, job):
sql2_tmp = """update scott.chinese_table set cn_flag=1,modify_date=current_date
where owner='{owner}'
AND table_name='{tablename}'
and column_name='{columnname}'
and EXISTS (select * from {owner}.{tablename} where LENGTH({columnname}) <> lengthb({columnname}))"""
schema,tableName,columnName=job
conn = cx_Oracle.connect(self.user, self.passwd, self.listener)
cursor = conn.cursor()
sql2 = sql2_tmp.format(owner=schema,tablename=tableName,columnname=columnName)
cursor.execute(sql2)
conn.commit()
cursor.close()
conn.close()
def run(self):
while True:
print(self.boss_q.qsize())
if self.boss_q.qsize() == 0:
print("ALL JOB IS DONE! \n")
return
try:
job = self.boss_q.get(timeout=6)
self.do(job)
except Exception as e:
print(f"find error: {e}")
def exec_ora_dml(*args):
user, passwd, listener, sql = args
conn = cx_Oracle.connect(user, passwd, listener)
cursor = conn.cursor()
cursor.execute(sql)
conn.commit()
cursor.close()
conn.close()
def exec_ora_select(conn_list,sql):
user, passwd, listener = conn_list
conn = cx_Oracle.connect(user, passwd, listener)
cursor = conn.cursor()
cursor.execute(sql)
rows = cursor.fetchall()
cursor.close()
conn.close()
return rows
if __name__ == '__main__':
user = 'xxx'
passwd = 'xxxx'
listener = 'xxx.xxx.xxx.xxx:1521/kiwi'
conn_list = user, passwd, listener
max_workers = 20
parallel = 10
workers = []
sql1 = """
select a.owner,a.table_name,a.column_name from dba_tab_columns a,(
select owner,object_name from dba_objects where owner in (
select name from sys.user$
where type#=1 and spare4 is not null
and datats# >3 and password is not null
and to_char(ctime,'yyyy-mm-dd') <>(select to_char(min(ctime),'yyyy-mm-dd') from sys.user$)
and name not in ('HR','OE','IX','SH','PM','BI')
)
and object_type='TABLE'
MINUS
select owner,table_name from dba_external_tables where owner in (
select name from sys.user$
where type#=1 and spare4 is not null
and datats# >3 and password is not null
and to_char(ctime,'yyyy-mm-dd') <>(select to_char(min(ctime),'yyyy-mm-dd') from sys.user$)
and name not in ('HR','OE','IX','SH','PM','BI')
)
) d
where a.owner=d.owner and a.table_name=d.object_name
and a.data_type in ('CHAR','VARCHAR2','NVARCHAR2')
and mod(a.data_length,2)=0
"""
rows = exec_ora_select(conn_list, sql1)
boss_q = queue.Queue()
if len(rows) != 0:
for row in rows:
boss_q.put(row)
pool = ThreadPool(max_workers)
for i in range(parallel):
worker = Worker(conn_list,boss_q,)
workers.append((worker,))
pool.starmap(lambda worker: worker.run(), workers)
pool.close()
pool.join()
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。




