Talk Show Segmentation System Based On Twitter Using K-Medoids Clustering Algorithm

Innovations on a talk show on television can be a threat. Audience will be divided into groups so that it can make a downgrade rating program. Program ratings affect companies that will use advertising services. Television companies will go bankrupt. The biggest source of income is sales of advertising services. One way to overcome them can be analyzed in public opinion. The results of the analysis can provide information about the attractiveness of the community towards the program. But the analysis process takes a long time and can be done only by a competent person so another process is needed to get the results of the analysis that is fast and can be done by anyone. In this study using K-Medoids Clustering in the process of identifying public opinion. The clustering process known as unsupervised learning will be combined with the labeling process. The previous episode's tweet data will be labeled and then used to obtain the predicted labels from other cluster members. Before going through the clustering stage, the tweet data will go through the text preprocessing stage then transformed into a numeric form based on the appearance of the word. Transformation data will be clustered by calculating proximity using Cosine Similarity. Labels from the Medoids cluster will be used on unlabeled tweet data. The cluster results were tested using the Silhouette Coefficient method to get 0.19 results. However, this method successfully predicted public opinion and achieved an accuracy of 80%.


I. INTRODUCTION
Data that is always increasing every second can be used to obtain information. The process of processing data or data mining can be done in various ways, one of which is clustering. Clusterization is a method of grouping data based on the degree of similarity of characteristics to one another. The clustering method is divided into two, namely hierarchical clustering and partitional clustering. On partitional clustering there are K-Means Clustering, K-Medoids Clustering (PAM) and CLARA algorithms. In the field of energy efficiency in finding and analyzing energy periodicity using the K-Medoids Clustering and CLARA algorithms, the K-Medoids Clustering algorithm is the best algorithm in the case of matrix calculations using Euclidean Distance (Ruiz, Pegalajar, Arcucci, & Molina-Solana, 2020). In the application of 10,000 KEEL transactions with the K-Means and K-Medoids algorithms, the results show that K-Medoids is superior in execution time and is not noise sensitive (Arora, Deepali, & Varshney, 2016). Many other fields use clustering in processing data.
Clusterization can be applied to the broadcasting field. Analysis of public opinion succeeded in producing the necessary information. A study processes data to produce information on people's lifestyles (Li, 2020). Clustering with the K-Means algorithm gets information on program production decisions to determine broadcast schedules so as to maintain the rating of the program (Kui et al., 2020). This makes advertisers interested in using advertising services on TV. They think that television stations with high ratings are more competent (Pribadi, Yoedtadi, & Siswoko, 2017). In this study using unsupervised learning types.
In this study, built a system for segmenting public opinion on social media twitter using K-Medoids Clustering. The difference with other segmentation research is that in this research, the clustering process of public opinion will use a labeling process. In addition, tweet data that has been labeled will be used at the label prediction stage in the tweet episode reruns data. Analysis of public opinion succeeded in producing the necessary information (Devika, Sunitha, & Ganesh, 2016). The "free" nature of social media makes everyone tend to express their views in the form of comments (Hutto & Gilbert, 2014) (Ahuja, Chug, Kohli, Gupta, & Ahuja, 2019). A literature review identified 13 studies that applied different clustering methods. The results show that the use of unsupervised types of learning in mining social media data has several weaknesses (Guftar, Ali, Raja, & Qamar, 2015).
A study to try to reduce costs in the clustering process tried to implement a labeling process and succeeded in reducing costs by 50-60% (Shuyang, Heittola, & Virtanen, 2017) (Darnstadt, Meutzner, & Kolossa, 2014). This process was also successfully carried out on the K-Medoids algorithm in the active learning method (Shuyang, Heittola, & Virtanen, 2018) (Ji, Wang, & Ma, 2019). Seeing from this research, this study will combine supervised learning and unsupervised learning, which is expected to help determine the label on the tweet data that will be processed. The tweet data will then go through the preprocessing stage and be transformed using term frequency calculations. Tweet data in numeric form will be clustered using K-Medoids Clustering with Cosine Similarity calculation. K-Medoids is a partitional clustering algorithm that has the aim of breaking the dataset into groups. The KMedoids algorithm can reduce new data outside the segment to enter the cluster center (Tan, 2018). The label for each cluster will be predicted based on the majority of existing labels.

A. Data Mining
Data mining is the process of finding meaningful information through new correlation patterns and trends by sorting through large amounts of data stored in repositories using pattern recognition techniques as well as statistical and mathematical techniques. The stages in data mining are data selection, data cleaning, transformation, data mining and interpretation.

B. Twitter
Twitter social media is a service for friends, family, and co-workers to communicate and stay connected through exchange of messages. Besides being used as a means of communication, Tweets can also be used in the analysis stage (Hutto & Gilbert, 2014). Twitter social media is quite good when it becomes an object in analyzing a text.
Analytical techniques can be proven tools for extracting information. Every information is obtained from various reviews or Tweets uploaded by Twitter users. Twitter social media is quite good if it becomes an object in analyzing a text (Hutto & Gilbert, 2014) (Ahuja et al., 2019). The text is limited to 140 characters, making the information conveyed by the public more meaningful (Dos Santos & Gatti, 2014).

C. Preprocessing
Preprocessing is done so that the data is ready to be processed. The preprocessing stage is carried out such as the text mining stage for information retrieval, text classification and text grouping (Vijayarani, Ilamathi, & Nithya, 2016). In the process, using 4 stages, namely case folding, tokenizing, filtering and stemming.
The case folding stage is the stage of changing it to the same type, which can be upper and lowercase letters and eliminating notations other than letters. The tokenizing stage is the stage of cutting sentences into words. The filtering stage is the stage of data collection and deletion of unused words. The last is the stemming stage, which is a grouping of other words that have a similar root.

D. Transformasi Data
The transformation stage is carried out by term frequency. Term Frequency (TF) is the frequency of appearance of a term in the document concerned. The greater the number of occurrences of a word in the document, the greater its weight or will provide a greater suitability value Tahap transformasi dilakukan dengan term frequency (Chrisnanto & Abdillah, 2015).

E. Clustering
Clustering is a method of grouping data into different groups, so that the data in each group has the same trends and patterns. K-Medoids is an algorithm that represents clusters formed using a central point originating from cluster members. The stages are: a. Initialize early medoids. To determine the optimal k value, you can use the Elbow technique (Guftar et al., 2015 The equation used in calculating distance is Cosine Distance which can be seen in (1) where a is the average distance between the medoid and the objects in the cluster and b is the average distance between the medoid and the objects outside the cluster. The calculation of the distance in the cluster and the calculation of the distance outside the cluster using Euclidean Distance with equation 4. d (x,y) = i = 1 ; 1, 2, 3,. . . n (4) d is distance, x and y are centroid variable values.

III. RESULT AND DISCUSSION
The system is built through three stages, namely input, process and output. In the input, there is twitter data which is divided into two, namely training data and test data. This test data appears because the method used is a combination of supervised and unsupervised as has been presented in the background.
At the input stage, there are two types of processes described in the experimental data process. At the process and output stages, there are several stages starting from preprocessing to getting labels.
The illustration of the twitter social media segmentation system can be seen in Figure 1.

A. Data
The input process uses data from social media twitter from each talkshow episode. The Tweet used is the 2020 posting year. The Tweet data used will be divided into two types, namely Tweet data during the first impression and Tweet data during the replay. The training data is written using the CSV format as shown in Figure 2 and the test data is shown in Figure 3.

B. Labeling Process
At the input stage, there is a process for tagging Tweets. The clustering process is one type of unsupervised learning. However, at this stage, labeling is used to make it more optimal in segmentation results. There are three types of labels used. The determination of the label is carried out by the authorized party in the company. Data that is labeled is Tweet data at the time of the first broadcast (not replay). The data can be seen in

C. Preprocessing
Preprocessing is done so that the data is ready to be processed. The preprocessing stage is carried out such as the text mining stage for information retrieval, text classification and text grouping. In the process, using 4 stages, namely case folding, tokenizing, filtering and stemming.

a. Case Folding
This first stage is called case folding. In the process all data is converted to lowercase. In addition, characters other than letters 'a' through 'z' will be removed as delimiters. Delimiters covering all characters other than ASCII notation will be deleted.

b. Tokenizing
The second stage is the stage of cutting sentences. This stage broadly breaks down a set of characters in a text. Characters that are discussed, such as spaces, punctuation or others that have a function as a pause.

c. Filtering
The third stage is taking words from the results of the previous stage. One of the applications is by removing words that are considered stopwords. Types of stopwords such as conjunctions, affixes and other words that have the same function. Additional stopwords are made to suit the case. Words that are mentioned too often will be deleted.

d. Stemming
The last step is needed to reduce the number of different indexes of a document. This stage performs a grouping of other words which have the same root but have different forms. The stemming stage also processes non-standard written words, typos and languages other than Indonesian. Words that have the same meaning or synonym will be counted as different words.
The results of Tweet preprocessing can be seen in Figure 5.

D. Transformation using TF
Data transformation is the process of calculating the weight of the processed Tweet text. In determining the cluster, the data needs to be calculated using a formula. So Tweets need to be transformed into numeric data. One of the processes using data will be calculated based on term frequency (TF).

E. Cluster using K-Medoids
In the fifth stage, the process of determining clusters using the K-Medoids Algorithm. The transformed data set was calculated using the kmedoids formula to obtain a fixed medoid. The condition for getting a fixed medoid is if the difference between the total distance and the total distance on the new medoid is less than 0. The ideal k value is determined by the Elbow method. Elbow method is a method to find out information in determining the best number of clusters. The results can be seen in Figure 6.  The graphic above shows the elbow in figure 6.The ideal k value used when calculating the cluster with K-Medoids is 6. The results of the cluster can be seen in Figure 7.  Figure 7 above the x-axis shows the Tweet number and the y-axis shows the location of the cluster.

F. Segment Identification
The sixth stage is segment identification. This stage will show the cluster results and Tweet data labels. Conclusions can be drawn from the majority of labels that appear in each cluster formed. Tweet data on the reruns episode will have labels based on the majority of labels formed in the cluster. Cluster results can be seen in Table 1.

G. Testing
The last stage is testing the system that has been built. Tests were carried out on the data for the cluster quality test and the label results accuracy test. Testing the quality of the cluster formed using the Silhouette Coefficient method. The quality test results can be seen in Figure 8. On average, the 6 clusters formed get 0.19 quality, which means that the cluster formed has a weak structure. Label accuracy testing can be seen in Table 3. The results of the five tests obtained an average of 80%.

IV. CONCLUSION
K-Medoids Clustering as one of the clustering algorithms has successfully classified Tweet data. Test data in the form of Tweet episodes reruns of 5 Tweets and training data in the form of Tweet data on the first broadcast of 25 Tweets resulting in 6 clusters. The results of the cluster quality test are considered to have a weak structure because they only have a value of 0.19. This happens because the data set is transformed using the term frequency method. Words that have the same meaning will be considered different by this method so that the transformation results are less representative. Unlike the label accuracy testing, the system built from the five tests managed to get 80% points and was declared successful in showing public opinion on Twitter about the program.