python实现Bi-GRU语义解析实现中文人物关系分析【文末源码】|焦点热讯
Bi-GRU语义解析实现中文人物关系分析
引言:语义解析作为自然语言处理的重要方面,其主要作用如下:在词的层次上,语义分析的基本任务是进行词义消歧;在句子层面上,语义角色标注是所关心的问题;在文章层次上,指代消解、篇章语义分析是重点。
(资料图)
而实体识别和关系抽取是构建知识图谱等上层自然语言处理应用的基础。关系抽取可以简单理解为一个分类问题:给定两个实体和两个实体共同出现的句子文本,判别两个实体之间的关系。
使用CNN或者双向RNN加Attention的深度学习方法被认为是现在关系抽取stateofart的解决方案。已有的文献和代码,大都是针对英文语料,使用词向量作为输入进行训练。这里以实践为目的,介绍一个用双向GRU、字与句子的双重Attention模型,以天然适配中文特性的字向量(characterembedding)作为输入,网络爬取数据作为训练语料构建的中文关系抽取模型。代码主要是基于清华的开源项目thunlp/TensorFlow-NRE开发,其中效果如下:
一、实验前的准备:
首先我们使用的python版本是3.6.5所用到的模块如下:
tensorflow模块用来创建整个模型训练和保存调用以及网络的搭建框架等等。
numpy模块用来处理数据矩阵运算。
Sklearn模块是一些机器学习算法的集成模块。
二、模型的网络搭建
其中模型的网络图如下:
双向GRU加字级别attention的模型想法来自文章“Attention-BasedBidirectional Long Short-Term Memory Networks for RelationClassification” [Zhou etal.,2016]。这里将原文的模型结构中的LSTM改为GRU,且对句子中的每一个中文字符输入为characterembedding。这样的模型对每一个句子输入做训练,加入字级别的attention。
句子级别attention的想法来自文章“NeuralRelation Extraction with Selective Attention over Instances”[Lin etal.,2016]。原文模型结构图如下,这里将其中对每个句子进行encoding的CNN模块换成上面的双向GRU模型。这样的模型对每一种类别的句子输入做共同训练,加入句子级别的attention。
建立network.py文件,定义词向量大小、步数、类别数等等:def__init__(self): self.vocab_size= 16691 self.num_steps= 70 self.num_epochs= 10 self.num_classes= 12 self.gru_size= 230 self.keep_prob= 0.5 self.num_layers= 1 self.pos_size= 5 self.pos_num= 123 #the number of entity pairs of each batch during training or testing self.big_num= 50
然后建立GRU网络。按照所给出的网络模型图,定义出网络基本框架作为具体参数的调用:
def__init__(self,is_training, word_embeddings, settings): self.num_steps= num_steps = settings.num_steps self.vocab_size= vocab_size=settings.vocab_size self.num_classes= num_classes = settings.num_classes self.gru_size= gru_size = settings.gru_size self.big_num= big_num = settings.big_num self.input_word= tf.placeholder(dtype=tf.int32,shape=[None,num_steps], name='input_word') self.input_pos1= tf.placeholder(dtype=tf.int32,shape=[None,num_steps], name='input_pos1') self.input_pos2= tf.placeholder(dtype=tf.int32,shape=[None,num_steps], name='input_pos2') self.input_y= tf.placeholder(dtype=tf.float32,shape=[None,num_classes], name='input_y') self.total_shape= tf.placeholder(dtype=tf.int32,shape=[big_num+ 1],name='total_shape') total_num = self.total_shape[-1] word_embedding = tf.get_variable(initializer=word_embeddings,name='word_embedding') pos1_embedding = tf.get_variable('pos1_embedding',[settings.pos_num, settings.pos_size]) pos2_embedding =tf.get_variable('pos2_embedding',[settings.pos_num, settings.pos_size]) attention_w =tf.get_variable('attention_omega',[gru_size, 1]) sen_a = tf.get_variable('attention_A',[gru_size]) sen_r = tf.get_variable('query_r',[gru_size, 1]) relation_embedding = tf.get_variable('relation_embedding',[self.num_classes,gru_size]) sen_d = tf.get_variable('bias_d',[self.num_classes]) gru_cell_forward = tf.contrib.rnn.GRUCell(gru_size) gru_cell_backward = tf.contrib.rnn.GRUCell(gru_size) ifis_trainingandsettings.keep_prob< 1: gru_cell_forward =tf.contrib.rnn.DropoutWrapper(gru_cell_forward,output_keep_prob=settings.keep_prob) gru_cell_backward =tf.contrib.rnn.DropoutWrapper(gru_cell_backward,output_keep_prob=settings.keep_prob) cell_forward = tf.contrib.rnn.MultiRNNCell([gru_cell_forward] *settings.num_layers) cell_backward =tf.contrib.rnn.MultiRNNCell([gru_cell_backward] *settings.num_layers) sen_repre = [] sen_alpha = [] sen_s = [] sen_out = [] self.prob= [] self.predictions= [] self.loss= [] self.accuracy= [] self.total_loss= 0.0 self._initial_state_forward= cell_forward.zero_state(total_num, tf.float32) self._initial_state_backward= cell_backward.zero_state(total_num, tf.float32) #embedding layer inputs_forward= tf.concat(axis=2,values=[tf.nn.embedding_lookup(word_embedding,self.input_word), tf.nn.embedding_lookup(pos1_embedding, self.input_pos1), tf.nn.embedding_lookup(pos2_embedding, self.input_pos2)]) inputs_backward = tf.concat(axis=2, values=[tf.nn.embedding_lookup(word_embedding,tf.reverse(self.input_word,[1])), tf.nn.embedding_lookup(pos1_embedding, tf.reverse(self.input_pos1,[1])), tf.nn.embedding_lookup(pos2_embedding, tf.reverse(self.input_pos2,[1]))]) outputs_forward = [] state_forward =self._initial_state_forward #Bi-GRU layer withtf.variable_scope('GRU_FORWARD')asscope: forstepinrange(num_steps): ifstep> 0: scope.reuse_variables() (cell_output_forward, state_forward) = cell_forward(inputs_forward[:,step, :], state_forward) outputs_forward.append(cell_output_forward) outputs_backward= [] state_backward = self._initial_state_backward withtf.variable_scope('GRU_BACKWARD')asscope: forstepinrange(num_steps): ifstep> 0: scope.reuse_variables() (cell_output_backward, state_backward) =cell_backward(inputs_backward[:, step, :], state_backward) outputs_backward.append(cell_output_backward) output_forward = tf.reshape(tf.concat(axis=1,values=outputs_forward),[total_num, num_steps, gru_size]) output_backward =tf.reverse( tf.reshape(tf.concat(axis=1,values=outputs_backward),[total_num, num_steps, gru_size]), [1]) #word-level attention layer output_h= tf.add(output_forward, output_backward) attention_r =tf.reshape(tf.matmul(tf.reshape(tf.nn.softmax( tf.reshape(tf.matmul(tf.reshape(tf.tanh(output_h), [total_num *num_steps, gru_size]), attention_w), [total_num, num_steps])), [total_num, 1,num_steps]), output_h), [total_num, gru_size])
三、模型的训练和使用
其中用来训练的语料获取,由于中文关系抽取的公开语料比较少。我们从distantsupervision的方法中获取灵感,希望可以首先找到具有确定关系的实体对,然后再去获取该实体对共同出现的语句作为正样本。负样本则从实体库中随机产生没有关系的实体对,再去获取这样实体对共同出现的语句。
对于具有确定关系的实体对,我们从复旦知识工厂得到,感谢他们提供的免费API!一个小问题是,相同的关系label在复旦知识工厂中可能对应着不同的标注,比如“夫妻”,抓取到的数据里有的是“丈夫”,有的是“妻子”,有的是“伉俪”等等,需要手动对齐。
模型的训练:
建立train_GRU文件,通过训练已经经过处理后得到的npy文件进行训练。
其中训练的数据如下:
代码如下:
defmain(_): #the path to save models save_path= './model/' print('readingwordembedding') wordembedding = np.load('./data/vec.npy') print('readingtraining data') train_y = np.load('./data/train_y.npy') train_word = np.load('./data/train_word.npy') train_pos1 = np.load('./data/train_pos1.npy') train_pos2 = np.load('./data/train_pos2.npy') settings = network.Settings() settings.vocab_size =len(wordembedding) settings.num_classes = len(train_y[0]) big_num=settings.big_num withtf.Graph().as_default(): sess = tf.Session() withsess.as_default(): initializer = tf.contrib.layers.xavier_initializer() withtf.variable_scope("model",reuse=None,initializer=initializer): m = network.GRU(is_training=True,word_embeddings=wordembedding,settings=settings) global_step = tf.Variable(0,name="global_step",trainable=False) optimizer = tf.train.AdamOptimizer(0.0005) train_op = optimizer.minimize(m.final_loss,global_step=global_step) sess.run(tf.global_variables_initializer()) saver = tf.train.Saver(max_to_keep=None) merged_summary = tf.summary.merge_all() summary_writer = tf.summary.FileWriter(FLAGS.summary_dir +'/train_loss',sess.graph) deftrain_step(word_batch,pos1_batch, pos2_batch, y_batch, big_num): feed_dict = {} total_shape = [] total_num = 0 total_word= [] total_pos1 = [] total_pos2 = [] foriinrange(len(word_batch)): total_shape.append(total_num) total_num += len(word_batch[i]) forwordinword_batch[i]: total_word.append(word) forpos1inpos1_batch[i]: total_pos1.append(pos1) forpos2inpos2_batch[i]: total_pos2.append(pos2) total_shape.append(total_num) total_shape =np.array(total_shape) total_word =np.array(total_word) total_pos1 =np.array(total_pos1) total_pos2 =np.array(total_pos2) feed_dict[m.total_shape] =total_shape feed_dict[m.input_word] =total_word feed_dict[m.input_pos1] = total_pos1 feed_dict[m.input_pos2] = total_pos2 feed_dict[m.input_y] = y_batch temp, step,loss, accuracy, summary, l2_loss, final_loss = sess.run( [train_op, global_step, m.total_loss, m.accuracy,merged_summary, m.l2_loss, m.final_loss], feed_dict) time_str =datetime.datetime.now().isoformat() accuracy =np.reshape(np.array(accuracy), (big_num)) acc =np.mean(accuracy) summary_writer.add_summary(summary, step) ifstep% 50==0: tempstr = "{}:step {}, softmax_loss {:g}, acc {:g}".format(time_str,step, loss, acc) print(tempstr) forone_epochinrange(settings.num_epochs): temp_order = list(range(len(train_word))) np.random.shuffle(temp_order) foriinrange(int(len(temp_order)/ float(settings.big_num))): temp_word = [] temp_pos1= [] temp_pos2 = [] temp_y = [] temp_input = temp_order[i *settings.big_num:(i + 1)* settings.big_num] forkintemp_input: temp_word.append(train_word[k]) temp_pos1.append(train_pos1[k]) temp_pos2.append(train_pos2[k]) temp_y.append(train_y[k]) num = 0 forsingle_wordintemp_word: num += len(single_word) ifnum> 1500: print('outof range') continue temp_word= np.array(temp_word) temp_pos1 =np.array(temp_pos1) temp_pos2 =np.array(temp_pos2) temp_y =np.array(temp_y) train_step(temp_word,temp_pos1, temp_pos2, temp_y, settings.big_num) current_step = tf.train.global_step(sess, global_step) ifcurrent_step> 8000andcurrent_step% 100==0: print('savingmodel') path = saver.save(sess, save_path +'ATT_GRU_model',global_step=current_step) tempstr = 'havesaved model to ' +path print(tempstr)
训练过程:
2.模型的测试:
其中得到训练后的模型如下:
whileTrue: #try: #BUG: Encoding error if user input directly from commandline. line= input('请输入中文句子,格式为"name1name2 sentence":') #Readfile from test file ''' infile = open('test.txt', encoding='utf-8') line= '' for orgline in infile: line =orgline.strip() break infile.close() ''' en1,en2, sentence = line.strip().split() print("实体1:" +en1) print("实体2:" +en2) print(sentence) relation=0 en1pos= sentence.find(en1) ifen1pos== -1: en1pos = 0 en2pos= sentence.find(en2) ifen2pos== -1: en2post=0 output= [] #length of sentence is 70 fixlen= 70 #max length of position embedding is 60 (-60~+60) maxlen=60 #Encodingtest x foriinrange(fixlen): word = word2id['BLANK'] rel_e1 = pos_embed(i - en1pos) rel_e2 =pos_embed(i - en2pos) output.append([word, rel_e1,rel_e2]) foriinrange(min(fixlen,len(sentence))): word=0 ifsentence[i]notin word2id: #print(sentence[i]) #print('==') word= word2id['UNK'] #print(word) else: #print(sentence[i]) #print('||') word= word2id[sentence[i]] #print(word) output[i][0]= word test_x = [] test_x.append([output]) #Encodingtest y label= [0foriinrange(len(relation2id))] label[0]= 1 test_y= [] test_y.append(label) test_x =np.array(test_x) test_y = np.array(test_y) test_word = [] test_pos1 = [] test_pos2 =[] foriinrange(len(test_x)): word = [] pos1 = [] pos2= [] forjintest_x[i]: temp_word = [] temp_pos1 = [] temp_pos2 = [] forkinj: temp_word.append(k[0]) temp_pos1.append(k[1]) temp_pos2.append(k[2]) word.append(temp_word) pos1.append(temp_pos1) pos2.append(temp_pos2) test_word.append(word) test_pos1.append(pos1) test_pos2.append(pos2) test_word = np.array(test_word) test_pos1 =np.array(test_pos1) test_pos2 = np.array(test_pos2) prob,accuracy = test_step(test_word, test_pos1, test_pos2, test_y) prob = np.reshape(np.array(prob), (1,test_settings.num_classes))[0] print("关系是:") #print(prob) top3_id= prob.argsort()[-3:][::-1] forn,rel_id inenumerate(top3_id): print("No."+str(n+1)+ ":" +id2relation[rel_id] + ",Probability is " +str(prob[rel_id])) 测试效果:
完整代码:https://gitcode.net/qq_42279468/python-bi-gru.git