概要
本篇以实际案例为背景,介绍不同技术组件对数据建模的特点,并以ES为背景,介绍常用的联合查询的利弊,最后介绍了一下文件系统分词器path_hierarchy和嵌套对象的使用。
数据模型对比
实际项目中,电商平台系统常见的组合Java、Mysql和Elasticsearch,以基础的部门-员工实体为案例。
JavaBean类型定义
如果是JavaBean类型,会这样定义
public class Department {private Long id;private String name;private String desc;private List<Long> userIds;}public class Employee {private Long id;private String name;private byte gender;private Department dept;}
数据库模型定义
如果是关系型数据库(mysql),会这样建表
create table t_department (id bigint(20) not null auto_increment,name varchar(30) not null,desc varchar(80) not null,PRIMARY KEY (`id`))create table t_employee (id bigint(20) not null auto_increment,name varchar(30) not null,gender tinyint(1) not null,dept_id bigint(20),PRIMARY KEY (`id`))
依据数据库三范式设计表,每个实体设计成独立的表,用主外键约束进行关联,按照现有的数据表规范,已经不再使用外键约束了,外键约束放在应用层控制。
ES文档数据模型
如果es的文档数据模型,会这样设计document
{"deptId": 1,"deptname": "CEO办公室","desc":"这一个有情怀的CEO","employee":[{"userId":1,"name":Lily,"gender":0},{"userId":2,"name":Lucy,"gender":0},{"userId":3,"name":Tom,"gender":1}]}
es更类似面向对象的数据模型,将所有关联的数据放在一个document里。
JOIN查询
我们以博客网站为案例背景,建立博客网站中博客与用户的数据模型。
将用户与博客分别建立document,分割实体,类似数据库三范式,并使用关键field(userId)建立依赖关系
先建立两个实体document,放一条示例数据
PUT /blog/user/1{"id":1,"username":"Lily","age":18}PUT /website/article/1{"title":"my frist blog","content":"this is my first blog, thank you","userId":1}
需求:要查询用户名Lily发表的博客 步骤:1)查询用户document,根据名字Lily查询到它的userId;2)根据第一步查询返回的userId,重新组装请求报文,查询博客docuement 示例报文:
GET /blog/user/_search{"query": {"match": {"username.keyword": "Lily"}}}GET /website/article/_search{"query": {"constant_score": {"filter": {"terms": {"userId": ["1"]}}}}}
以上步骤叫做应用层Join实现关联查询
优点:结构清晰明了,数据不冗余,维护方便。 缺点:应用层join,如关联的数据过多,查询性能很低。
适用场景:两层join,第一层document查询基本上能做到精准查询,返回的结果数很少,并且第二层数据量特别大。如案例中的场景,根据名称找userId,返回的数据相对较少,第二层的查询性能就比较高,第二层数据属于业务数据类型,数据量肯定特别大。
适度冗余减少应用层Join查询
普通查询
接上面案例,修改博客document,将username冗余到该document中,如:
PUT /website/article/2{"title":"my second blog","content":"this is my second blog, thank you","userInfo": {"id":1,"username":"Lily"}}
查询时直接指定username:
GET /website/article/_search{"query": {"constant_score": {"filter": {"term": {"userInfo.username.keyword": "Lily"}}}}}
优点:一次查询即可,性能较高 缺点:若冗余的字段有更新,维护非常麻烦
适合场景:适当的冗余比较有必要,可以减小join查询,关系型数据库设计也经常有冗余数据的优化,只要挑选冗余字段时要注意尽量选变更可能性小的字段,避免查询一时爽,更新想撞墙这种事情发生。
数据冗余设计后聚合分组查询
造点测试数据进去
PUT /website/article/3{"title":"my third blog","content":"this is my third blog, thank you","userInfo": {"id":2,"username":"Lucy"}}PUT /website/article/4{"title":"my 4th blog","content":"this is my 4th blog, thank you","userInfo": {"id":2,"username":"Lucy"}}
分组查询:Lily发表了哪些博客,Lucy发表了哪些博客
GET website/article/_search{"size": 0,"aggs": {"group_by_username": {"terms": {"field": "userInfo.username.keyword"},"aggs": {"top_articles": {"top_hits": {"size": 10,"_source": {"includes": "title"}}}}}}}
文件搜索
文件类型的数据有个很大的特点:有目录层次关系。如果我们有对文件搜索的需求,可以这个建立索引:
PUT /files{"settings": {"analysis": {"analyzer": {"paths": {"tokenizer":"path_hierarchy"}}}}}PUT /files/_mapping/file{"properties": {"name": {"type": "keyword"},"path": {"type": "keyword","fields": {"tree": {"type": "text","analyzer": "paths"}}}}}
注意分词器path_hierarchy,会把/opt/data/log分成
/opt//opt/data//opt/data/log
插入一条测试数据
PUT /files/file/1{"name":"hello.txt","path":"/opt/data/txt/"}
搜索案例
指定文件名,指定具体路径搜索
GET files/file/_search{"query": {"bool": {"must": [{"match": {"name": "hello.txt"}},{"match": {"path": "/opt/data/txt/"}}]}}}
/opt路径下的hello.txt文件(包含子目录)
GET files/file/_search{"query": {"bool": {"must": [{"match": {"name": "hello.txt"}},{"match": {"path.tree": "/opt/"}}]}}}
区别:path与path.tree的使用 path.tree是会分词的,并且指定分词器为path_hierarchy path不分词,直接使用。
nested object数据类型
提出问题
用普通的object对象做数据冗余时,如果冗余的数据是一个数组集合类的,查询可能会出问题,例如:博客信息下面的评论,是一个集合类型
PUT /website/article/5{"title": "清茶豆奶发表的一篇技术帖子","content": "我是清茶豆奶,大家要不要考虑关注一下Java架构社区啊","tags": [ "IT技术", "Java架构社区" ],"comments": [{"name": "清茶","comment": "有什么干货没有啊?","age": 29,"stars": 4,"date": "2019-10-29"},{"name": "豆奶","comment": "我最喜欢研究技术,真好","age": 32,"stars": 5,"date": "2019-10-30"}]}
需求:查询被29岁的豆奶用户评论过的博客
GET /website/article/_search{"query": {"bool": {"must": [{"match": {"comments.name.keyword": "豆奶"}},{"match": {"comments.age": "29"}}]}}}
根据这条演示数据,这个条件是查不到结果的,但实际却查出来这条数据,为什么?
原因:object类型底层数据结果,会将json进行扁平化存储,如上例子,存储结构将变成:
{"title":["清茶","豆奶","发表","一篇","技术","帖子"],"content": ["我","清茶","豆奶","大家","要不要","考虑","关注","一下","Java架构社区"],tags:["IT技术", "Java架构社区"],comments.name:["清茶","豆奶"],comments.comment:["有","什么","干货","没有啊","我","最喜欢","研究","技术","真好"],comments.age:[29,32],comments.stars:[4,5],comments.date:["2019-10-29","2019-10-30"]}
这样"豆奶"和29就被命中了,跟预期的结果不一致。
解决办法
引入nested object类型,就可以解决这种问题。修改mapping,将comments的类型改成nested object。先删掉索引,再重新建立
PUT /website{"mappings": {"article": {"properties": {"comments": {"type": "nested","properties": {"name": {"type":"text"},"comment": {"type":"text"},"age": {"type":"short"},"stars": {"type":"short"},"date": {"type":"date"}}}}}}}
这样底层数据结构就成变成:
{"title":["清茶","豆奶","发表","一篇","技术","帖子"],"content": ["我","清茶","豆奶","大家","要不要","考虑","关注","一下","Java架构社区"],tags:["IT技术", "Java架构社区"],comments:[{"name":"清茶","comment":["有","什么","干货","没有啊"],"age":29,"stars":4,"date":"2019-10-29"},{"name":"豆奶","comment":["我","最喜欢","研究","技术","真好"],"age":32,"stars":5,"date":"2019-10-30"}]}
再查询结果为空,符合预期。
聚合查询示例
求博客每天评论的平均星数
GET /website/article/_search{"size": 0,"aggs": {"comments_path": {"nested": {"path": "comments"},"aggs": {"group_by_comments_date": {"date_histogram": {"field": "comments.date","interval": "day","format": "yyyy-MM-dd"},"aggs": {"stars_avg": {"avg": {"field": "comments.stars"}}}}}}}}
响应结果(有删节):
{"aggregations": {"comments_path": {"doc_count": 2,"group_by_comments_date": {"buckets": [{"key_as_string": "2019-10-29","key": 1572307200000,"doc_count": 1,"stars_avg": {"value": 4}},{"key_as_string": "2019-10-30","key": 1572393600000,"doc_count": 1,"stars_avg": {"value": 5}}]}}}}
小结
本篇以实际的案例为主,简单快速的介绍了实际项目中常用的数据联合查询,嵌套对象的使用等,很有实用价值,可以了解一下。




