1、首先定义一个索引,如下
PUT /person_news
{"settings": {"index": {"number_of_shards": "3","number_of_replicas": "0","max_result_window": "2000000000"}},"mappings": {"properties": {"companyName": {"type": "text","analyzer": "ik_max_word","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"newsSource": {"type": "keyword"},"newsContent": {"type": "text","analyzer": "ik_max_word"},"newsTitle": {"type": "text","analyzer": "ik_max_word","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"labels": {"type": "keyword"},"personInfo": {"type": "nested","properties": {"personName": {"type": "keyword"},"age": {"type": "integer"}}},"hotPoint": {"type": "long"}}}
}
person_news 这个索引是新闻和人相关的索引,companyName公司名称,定义了text类型,分词器采用的是ik分词,同时定义子字段类型为keyword,表示不分词(可以用来聚合和精准匹配);
newsSource 新闻来源,不分词;
newsContent 新闻内容,分词;
newsTitle 新闻标题,分词,同时建立子字段为keyword类型(同上companyName);
labels 标签,不分词(这里我准备给这个字段存储的是一个数组类型,就是一个新闻有多个标签,详见下文插入文档);
personInfo 新闻中的人物对象信息,采用的是nested结构,是一个数组对象,对象里面有personName和age字段;
hotPoint 新闻的热点值,通常通过此字段给新闻排序;
2、插入数据
PUT person_news/_doc/1
{"companyName": "中国恒大有限责任公司","newsSource": "新华社","newsContent": "今日中国证监会对中国恒大董事长许家印罚款4000万,并对其做出终身不能入市的处罚规定,其公司其他高管夏海钧也被做出相应处罚","newsTitle": "恒大许家印被罚","labels": ["恒大","许家印"],"personInfo": [{"personName": "许家印","age": 60},{"personName": "夏海钧","age": 59}],"hotPoint": 1
}
PUT person_news/_doc/2
{"companyName": "阿里巴巴有限责任公司","newsSource": "新华社","newsContent": "今日阿里公司集团董事长张勇卸任,由蔡崇信接任","newsTitle": "阿里张勇卸任","labels": ["阿里","蔡崇信","张勇"],"personInfo": [{"personName": "张勇","age": 60},{"personName": "蔡崇信","age": 54}],"hotPoint": 2
}
PUT person_news/_doc/3
{"companyName": "中国恒大有限责任公司","newsSource": "路透社","newsContent": "中国恒大董事长传闻跳楼,恒大资产负债高达几万亿,传闻阿里张勇将对恒大进行投资,进军房地产,具体消息恒大高管夏海钧予以否认","newsTitle": "恒大董事长许家印","labels": ["恒大","张勇"],"personInfo": [{"personName": "张勇","age": 54},{"personName": "夏海钧","age": 59}],"hotPoint": 3
}
3、可以通过kibana的DSL语句,查看文本采用某个分词器的效果(采用的是ik_max_word最大粒度分词)
GET /person_news/_analyze
{"analyzer": "ik_max_word","text": "中国恒大有限责任公司"
}
结果如下:
{"tokens" : [{"token" : "中国","start_offset" : 0,"end_offset" : 2,"type" : "CN_WORD","position" : 0},{"token" : "恒","start_offset" : 2,"end_offset" : 3,"type" : "CN_CHAR","position" : 1},{"token" : "大有","start_offset" : 3,"end_offset" : 5,"type" : "CN_WORD","position" : 2},{"token" : "有限责任","start_offset" : 4,"end_offset" : 8,"type" : "CN_WORD","position" : 3},{"token" : "有限","start_offset" : 4,"end_offset" : 6,"type" : "CN_WORD","position" : 4},{"token" : "责任","start_offset" : 6,"end_offset" : 8,"type" : "CN_WORD","position" : 5},{"token" : "公司","start_offset" : 8,"end_offset" : 10,"type" : "CN_WORD","position" : 6}]
}
采用ik_smart智能分词
{"tokens" : [{"token" : "中国","start_offset" : 0,"end_offset" : 2,"type" : "CN_WORD","position" : 0},{"token" : "恒","start_offset" : 2,"end_offset" : 3,"type" : "CN_CHAR","position" : 1},{"token" : "大","start_offset" : 3,"end_offset" : 4,"type" : "CN_CHAR","position" : 2},{"token" : "有限责任","start_offset" : 4,"end_offset" : 8,"type" : "CN_WORD","position" : 3},{"token" : "公司","start_offset" : 8,"end_offset" : 10,"type" : "CN_WORD","position" : 4}]
}
使用es自带的默认分词器,分词效果如下(会把每个中文分成一个个的汉字)
GET /person_news/_analyze
{"analyzer": "standard","text": "中国恒大有限责任公司"
}
{"tokens" : [{"token" : "中","start_offset" : 0,"end_offset" : 1,"type" : "<IDEOGRAPHIC>","position" : 0},{"token" : "国","start_offset" : 1,"end_offset" : 2,"type" : "<IDEOGRAPHIC>","position" : 1},{"token" : "恒","start_offset" : 2,"end_offset" : 3,"type" : "<IDEOGRAPHIC>","position" : 2},{"token" : "大","start_offset" : 3,"end_offset" : 4,"type" : "<IDEOGRAPHIC>","position" : 3},{"token" : "有","start_offset" : 4,"end_offset" : 5,"type" : "<IDEOGRAPHIC>","position" : 4},{"token" : "限","start_offset" : 5,"end_offset" : 6,"type" : "<IDEOGRAPHIC>","position" : 5},{"token" : "责","start_offset" : 6,"end_offset" : 7,"type" : "<IDEOGRAPHIC>","position" : 6},{"token" : "任","start_offset" : 7,"end_offset" : 8,"type" : "<IDEOGRAPHIC>","position" : 7},{"token" : "公","start_offset" : 8,"end_offset" : 9,"type" : "<IDEOGRAPHIC>","position" : 8},{"token" : "司","start_offset" : 9,"end_offset" : 10,"type" : "<IDEOGRAPHIC>","position" : 9}]
}