羅德興老師的教學歷程檔案 - 107-1 資訊科技大數據分析 - 資料集的第一步

企業資訊與管理系
助理教授/日導
羅德興

歷程檔案 Portfolio

資料集的第一步

指令：
> require (datasets)

>data() 看有哪些資料集

> head(airquality)

Ozone Solar.R Wind Temp Month Day

1 41 190 7.4 67 5 1

2 36 118 8.0 72 5 2

3 12 149 12.6 74 5 3

4 18 313 11.5 62 5 4

5 NA NA 14.3 56 5 5

6 28 NA 14.9 66 5 6

>

參考網址：

https://warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/iris_plots/

The Data

The iris dataset (included with R) contains four measurements for 150 flowers representing three species of iris (Iris setosa, versicolor and virginica). On this page there are photos of the three species, and some notes on classification based on sepal area versus petal area.

We can inspect the data in R like this:

> iris

The iris variable is a data.frame - its like a matrix but the columns may be of different types, and we can access the columns by name:

> class(iris)

看欄位名稱

> colnames(iris)

列出某一欄位數值

> iris$Petal.Length

You can also get the petal lengths by iris[,"Petal.Length"] or iris[,3] (treating the data frame like a matrix/array).

Simple Scatter Plots (簡易散佈圖)

Lets do a simple scatter plot, petal length vs. petal width:

> plot(iris$Petal.Length, iris$Petal.Width, main="Edgar Anderson's Iris Data")

Its interesting to mark or colour in the points by species. We could use the pch argument (plot character) for this. Consulting the help, we might use pch=21 for filled circles, pch=22 for filled squares, pch=23 for filled diamonds, pch=24 or pch=25 for up/down triangles. Doing this would change all the points... the trick is to create a list mapping the species to say 23, 24 or 25 and use that as the pch argument:

> plot(iris$Petal.Length, iris$Petal.Width, pch=c(23,24,25)[unclass(iris$Species)], main="Edgar Anderson's Iris Data")

This works by using c(23,24,25) to create a vector, and then selecting elements 1, 2 or 3 from it. How? unclass(iris$Species) turns the list of species from a list of categories (a "factor" data type in R terminology) into a list of ones, twos and threes:

> c(23,24,25)[unclass(iris$Species)]

We can do the same trick to generate a list of colours, and use this on our scatter plot:

> plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Edgar Anderson's Iris Data")

Using different colours its even more clear that the three species have very different petal sizes.

Draftsman's or Pairs Scatter Plots (成對散佈圖)

How do the other variables behave? We could generate each plot individually, but there is quicker way, using the pairs command on the first four columns:

> pairs(iris[1:4], main = "Edgar Anderson's Iris Data", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

This type of image is also called a Draftsman's display - it shows the possible two-dimensional projections of multidimensional data (in this case, four dimensional). An actual engineer might use this to represent three dimensional physical objects.

It looks like most of the variables could be used to predict the species - except that using the sepal length and width alone would make distinguishing Iris versicolor and virginica tricky (green and blue)

This is starting to get complicated, but we can write our own function to draw something else for the upper panels, such as the Pearson's correlation:

> panel.pearson <- function(x, y, ...) {
horizontal <- (par("usr")[1] + par("usr")[2]) / 2;
vertical <- (par("usr")[3] + par("usr")[4]) / 2;
text(horizontal, vertical, format(abs(cor(x,y)), digits=2))
}
> pairs(iris[1:4], main = "Edgar Anderson's Iris Data", pch = 21, bg = c("red","green3","blue")[unclass(iris$Species)], upper.panel=panel.pearson)

Here is another variation, with some different options showing only the upper panels, and with alternative captions on the diagonals:

> pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)], lower.panel=NULL, labels=c("SL","SW","PL","PW"), font.labels=2, cex.labels=4.5)

作業：
請依前述參考網址指令了解其意義，改變參數並實作，將標題加上您的學號、姓名。

全部共 0則留言

登入帳號密碼代表遵守學術網路規範

文章分類 Labels

最新文章 Top10

中華科技大學數位化學習歷程 - 意見反應