This new tweet-ids accommodate the new distinctive line of tweets on the Facebook API that are more than nine months (we

29/07/2022

The site Footnote 2 was applied as a means to gather tweet-ids Footnote 3 , this web site brings scientists that have metadata out of a good (third-party-collected) corpus out of Dutch tweets (Tjong Kim Performed and you may Van den Bosch, 2013). e., the historical limitation whenever asking for tweets centered on a pursuit inquire). The R-package ‘rtweet’ and you will complementary ‘lookup_status’ mode were used to gather tweets within the JSON style. The brand new JSON document constitutes a desk towards tweets’ suggestions, for instance the design time, the fresh tweet text message, together with provider (we.elizabeth., brand of Fb consumer).

Investigation tidy up and preprocessing

The JSON Footnote 4 files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as users who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, N_users = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.

The fresh tweet texts was basically converted to ASCII encoding. URLs, line holiday breaks, tweet headers, monitor names, and you will records so you can monitor brands was basically removed. URLs increase the profile count when found from inside the tweet. But not, URLs do not increase the reputation amount when they’re found at the termination of an excellent tweet. To end a good misrepresentation of your own actual reputation maximum one to pages had to deal with, tweets that have URLs ( not news URLs such as extra photo otherwise clips) was excluded.

Token and bigram study

The latest Roentgen package Footnote 5 ‘quanteda’ was utilized to tokenize new tweet messages into the tokens (we.age., separated words, punctuation s. Simultaneously, token-frequency-matrices was calculated that have: brand new regularity pre-CLC [f(token pre)], the fresh cousin frequency pre-CLC[P (token pre)], this new frequency post-CLC [f(token article)], the new relative regularity article-CLC and T-scores. The newest T-decide to try is a lot like a basic T-figure and you will works out new statistical difference between function (we.e., the fresh new relative keyword wavelengths). Negative T-ratings indicate a fairly highest occurrence out of a beneficial token pre-CLC, whereas confident T-scores mean a fairly high occurrence off a beneficial token post-CLC. This new T-score formula found in the analysis was presented as the Eq. (1) and you will (2). Letter is the total number regarding tokens for each and every dataset (i.e., both before and after-CLC). Which equation is dependant on the process getting linguistic computations by the Church mais aussi al. (1991; Tjong Kim Performed, 2011).

Part-of-address (POS) study

The latest R package Footnote 6 ‘openNLP’ was used to identify and you can count POS categories on www.datingranking.net/sugar-daddies-usa/in/bloomington/ the tweets (i.age., adjectives, adverbs, articles, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and you can miscellaneous). The new POS tagger works having fun with a maximum entropy (maxent) possibilities model so you can predict this new POS group centered on contextual features (Ratnaparkhi, 1996). The newest Dutch maxent model employed for the POS category are coached into the CoNLL-X Alpino Dutch Treebank analysis (Buchholz and ). The fresh new openNLP POS model might have been claimed with a reliability score from 87.3% when used in English social media data (Horsmann et al., 2015). An ostensible restrict of the current investigation ‘s the precision off the fresh new POS tagger. However, similar analyses was indeed performed both for pre-CLC and you can blog post-CLC datasets, meaning the accuracy of POS tagger are going to be uniform more one another datasets. Thus, i assume there aren’t any systematic confounds.

Investigation tidy up and preprocessing

Token and bigram study

Part-of-address (POS) study

CÙNG CHUYÊN MỤC