SIT742 关于作业A1

啊。。我一个完全没基础的人,怎么就选了个Python的课呢。。。

SIT742的A1,大致就是数据的预处理,包括数字跟文本的预处理,但因为给的数据是kaggle的红酒测评(链接:https://www.kaggle.com/zynicide/wine-reviews),感觉就是。。有点难搞。。

压力马斯内_(:з」∠)_

首先是数据的获取。。因为老师在GitHub放了所以就很简单。。

!pip install wget
import wget
link_to_data = 'https://github.com/tulip-lab/sit742/raw/master/Assessment/2019/data/wine.json'
DataSet = wget.download(link_to_data)
link_to_data = 'https://github.com/tulip-lab/sit742/raw/master/Assessment/2019/data/stopwords.txt'
DataSet = wget.download(link_to_data)

然后就是日常的模块导入

import json as js
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
file = 'wine.json'

然后。。。因为原来数据是json格式的。。。稍微用jq看一下结构,分辨出column name就继续操作了。。

import json
from pandas import DataFrame

with open("wine.json") as f:
    wine=json.load(f)
type(wine)
#实际上上面三行都不知道有没有用。。。

df = DataFrame(wine, columns = ['points','title','description','taster_name','taster_twitter_handle','price','designation','variety','region_1','region_2','province','country','winery'])
#先构造dataframe
df1 = df.dropna(subset=['points', 'price', 'variety', 'country'])
#然后把某些列的缺失值去掉。。毕竟产地啊价格什么的信息缺了就没了
df1.shape[0] - df1.count()
df1['points'] = df1['points'].astype('float')
#因为points很奇怪。。明明是float却认成object。。得转过来
df1.dtypes

前置工作就完成了。接下来就是第一部分的Numerical analysis了(其实就是预处理。。。)

1.1 Explore the data distribution for each column.

# write your code here
# you may use functions such as describe() on each attribute.
df1.describe()
#结果只有points跟price的统计。。这个慢慢改

1.2 Find the 10 varieties of wine which receives the highest number of reviews

# write your code here
# you may use functions such as value_counts()  
df1['variety'].value_counts().head(10)

1.3 Find varieties of wine having the average price less than 20, with the average pointsat least 90

# write your code here
# you may use functions such as groupby() 
##groupby variety -> calculate poings/price -> select by 20/90
df1.dtypes
df2 = df1.groupby('variety').agg({'points':'mean', 'price':'mean'})
#先用groupby按variety给酒2分类,然后后面接agg来做运算。。。
df2[(df2["points"]>90) & (df2["price"]<20)]
#然后再按条件筛选。。。

1.4 Build statistic table

#其实要求就是4列:国家,该国家最流行的酒的品种,这品种的平均评分,还有平均价格
# write your code here
# you may use functions such as groupby() and round(decimals=2)
def top_value_count(x, n=1):
    return x.value_counts().head(n)
#这个我有用吗。。。_(:з」∠)_

df3 = df1.groupby('country').agg({'points':'mean', 'price':'mean'}).round(decimals=2)
#对于平均评分跟价格,故技重施用groupby在计算就好。。。并且用decimals=2限定两位小数

se1 = df1.groupby('country')['variety'].value_counts()
se2 = se1.groupby('country').head(1)
#这里就很那啥了。。按国家来排酒的品种排名,然后用head(1)来只保留每个国家的第一行。。
df4 = se2.to_frame()
df5 = df4.drop(['variety'], axis=1)
#经过这种操作以后数据是变成series了,换回dataframe然后再把那个计数删掉。。。

df6 = df5.join(df3.reindex(df5.index, level=0))
df6.rename(columns={'country':'Country', 'variety':'Variety', 'points':'AvgPoints', 'price':'AvgPrice'}, inplace = True)
df6
#最后把国家+品种(df5)跟平均评分价格(df3)拼在以前,改下column name,就算是成了。。。_(:з」∠)_
# save your table to 'statisticByState.csv'
df6.to_csv('statisticByState.csv', 
            encoding='utf-8', 
            index=True, 
            header=True)
!dir
#最后写入文件。。。

这样就算搞完第一部分的数据预处理/分析了,接下来就是第二部分的文本处理了。。

2.1 extract high requency words in description

import re
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.probability import *
from itertools import chain
#from tqdm import tqdm
import codecs
from nltk.tokenize import word_tokenize as WordTokenizer
nltk.download('wordnet')
#其实也搞不清楚要哪几个。。。_(:з」∠)_
with open('stopwords.txt') as f:
    stop_words = f.read().splitlines()
stop_words = set(stop_words)
#pre-process-description-only
df10 = df1.drop(['points','title','taster_name','taster_twitter_handle','price','designation','variety','region_1','region_2','province','country','winery'], axis=1)

接下来才是真正的文本预处理。。。

# write your code here
# define your tokenize
# tokenization->lower case->stop words->stemming/lemma->Sentence Segmentation
# transformm case->tokenize->stem->stop_word->filter length/sentence...
tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?") 
tokens = df10['description'].apply(tokenizer.tokenize)
#第一步,tokenization,其实就是把词开,把句子拆成单词,用的通用表达式是老师给的。。_(:з」∠)_

tokens = [k.lower() for l in tokens for k in l]
#第二步,lowercase,大写变小写

tokens = [word for word in tokens if word not in stop_words]
#第三步,过滤stopwords,就是把a,the之类的玩意过滤掉。。。

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
['{0} -> {1}'.format(w, lemmatizer.lemmatize(w)) for w in tokens]
#第四步,转词根,其实也有stemming的,不过听说这个好一点。。。

tokens = [word for word in tokens if word.isalpha()]
#最后一步,把tokenization没过滤干净的数字什么的过滤掉。。
tokens

然后就是看哪些词是高频词(出现超过5000次),然后导出来txt了。。。

# find top common words with document frequencies > 5000
# you may use function FreqDist() and sort()
fdist = FreqDist(tokens)
list1 = list(filter(lambda x: x[1]>=5000,fdist.items()))
#先用FreqDist统计,然后用filter过滤。。。lambda就是当场做些小函数,像这样设定条件什么的。。。

# save your table to 'top_common_words.txt'  
import codecs
f = codecs.open('top_common_words.txt','w')
f.writelines(str(list1)+'\r\n')
#用的codecs写text。。感觉还是有点丑_(:з」∠)_

2.2 Find key words for describing Shiraz using TF-IDF

传说中的特征生成,还是最经典的tf-idf

(当然因为这个是带了条件的,只看看词跟Shiraz这个品种有什么关系。。还是要预处理一下的。。)

# select 'description' from 'variety' eqaul to  'Shiraz' 

df20 = df1.drop(['points','title','taster_name','taster_twitter_handle','price','designation','region_1','region_2','province','country','winery'], axis=1)
df21 = df20[df20['variety'].str.match('Shiraz')]
df22 = df21.drop(['variety'],axis=1)

tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?") 
tokens20 = df22['description'].apply(tokenizer.tokenize)

tokens20 = [k.lower() for l in tokens20 for k in l]

tokens20 = [word for word in tokens20 if word not in stop_words]

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
['{0} -> {1}'.format(w, lemmatizer.lemmatize(w)) for w in tokens20]

tokens20 = [word for word in tokens20 if word.isalpha()]
tokens20

然后就是搞tf-idf了。。。

# use TfidfVectorizer to calculate TF-IDF score
str20 = " ".join(tokens20)
str20 = [str20]

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer = "word")
tfs = tfidf.fit_transform(str20)

vocab = tfidf.get_feature_names()

import pandas as pd
df30 = pd.DataFrame({'word': [], 'weight': []})
for word, weight in zip(vocab, tfs.toarray()[0]):
    df30 = df30.append({'word': word, 'weight': weight}, ignore_index=True)

然后就是按题目筛选,高vector的。。

# find words with TF-IDF score >0.4 and sort them
df31 = df30[(df30["weight"]>0.1)]
df31.sort_values(["weight"], axis=0, ascending=False, inplace=True) 
df31
#其实本来应该筛0,4以上的,结果发现我做出来最高的也就0.3.。。感觉是哪里有问题。。。
#先这样吧_(:з」∠)_

# save your table to 'key_Shiraz.txt'   
df31.to_csv('key_Shiraz.txt', sep=' ', index=False)
#居然用写csv的模块写了txt,还能看,舒服_(:з」∠)_

总体就先这样。。肯定有东西能改的。。。

这样交上去没问题吧。。。_(:з」∠)_

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注