Chi-Squared Test – Zeta and Company

The Chi-squared test and Log-likelihood ratio test are somewhat more sophisticated statistical distribution tests with underlying hypothesis tests. These measures are widely used in CL and implemented in some corpus analysis tools, such as WordSmith Tools (Scott 1997), Wmatrix (Rayson et al. 2009), and AntConc (Anthony 2005). One problem with these measures is that p-values tend to be very low across the board, because two text corpora couldn’t be equal. The more important problem, however, is that they are designed to compare statistically independent events and handle corpora as a bag of words. These tests use the total number of words in the corpus and don’t consider an uneven distribution of words within a corpus (Lijffijt 2014).
References



		2241481
		
		
		measure_chi-squared
		
		
        
		1
		modern-humanities-research-association
		50
		date
		desc
		
		
		
		
		
		
		
		
		
		
        
        581
		https://zeta-project.eu/wp-content/plugins/zotpress/

		
			
				%7B%22status%22%3A%22success%22%2C%22updateneeded%22%3Afalse%2C%22instance%22%3Afalse%2C%22meta%22%3A%7B%22request_last%22%3A0%2C%22request_next%22%3A0%2C%22used_cache%22%3Atrue%7D%2C%22data%22%3A%5B%7B%22key%22%3A%22DPGYCFA6%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22McGillivray%20and%20T%5Cu00f3th%22%2C%22parsedDate%22%3A%222020%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EMcGillivray%2C%20Barbara%2C%20and%20G%26%23xE1%3Bbor%20Mih%26%23xE1%3Bly%20T%26%23xF3%3Bth%2C%20%26%23x2018%3BFrequency%26%23x2019%3B%2C%20in%20%3Ci%3EApplying%20Language%20Technology%20in%20Humanities%20Research%3A%20Design%2C%20Application%2C%20and%20the%20Underlying%20Logic%3C%5C%2Fi%3E%2C%20ed.%20by%20Barbara%20McGillivray%20and%20G%26%23xE1%3Bbor%20Mih%26%23xE1%3Bly%20T%26%23xF3%3Bth%20%28Springer%20International%20Publishing%2C%202020%29%2C%20pp.%2035%26%23x2013%3B46%2C%20doi%3A10.1007%5C%2F978-3-030-46493-6_3%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22bookSection%22%2C%22title%22%3A%22Frequency%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Barbara%22%2C%22lastName%22%3A%22McGillivray%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22G%5Cu00e1bor%20Mih%5Cu00e1ly%22%2C%22lastName%22%3A%22T%5Cu00f3th%22%7D%2C%7B%22creatorType%22%3A%22editor%22%2C%22firstName%22%3A%22Barbara%22%2C%22lastName%22%3A%22McGillivray%22%7D%2C%7B%22creatorType%22%3A%22editor%22%2C%22firstName%22%3A%22G%5Cu00e1bor%20Mih%5Cu00e1ly%22%2C%22lastName%22%3A%22T%5Cu00f3th%22%7D%5D%2C%22abstractNote%22%3A%22This%20chapter%20explains%20the%20concept%20of%20frequency%2C%20as%20well%20as%20various%20types%20of%20frequencies%20that%20can%20be%20measured%20in%20a%20text%20or%20in%20a%20collection%20of%20texts.%20Raw%20frequency%20and%20relative%20frequency%20are%20explained%20using%20the%20example%20of%20two%20short%20poems%20by%20the%20American%20poet%20Emily%20Dickinson%2C%20which%20demonstrates%20how%20frequency%20can%20be%20used%20to%20study%20the%20extent%20to%20which%20certain%20terms%20are%20present%20in%20a%20text.%20The%20first%20application%20example%20follows%20this%20line%20of%20thought%20by%20revealing%20those%20terms%20that%20are%20the%20most%20characteristic%20of%20the%20novel%20The%20Moonstone%20by%20the%20Victorian%20writer%20Wilkie%20Collins.%20This%20application%20example%20also%20shows%20the%20limitation%20of%20frequency%20analysis%20when%20applied%20to%20detect%20significant%20terms%20in%20a%20text.%20The%20example%20of%20The%20Moonstone%20is%20again%20used%20to%20show%20how%20frequency%20analysis%20can%20draw%20on%20a%20reference%20corpus%20to%20identify%20important%20words%20in%20a%20text.%20The%20chapter%5Cu2019s%20second%20application%20example%20focuses%20on%20frequency%20variation%20in%20historical%20corpora.%20It%20uses%20a%20simple%20working%20hypothesis%3A%20in%20a%20historical%20corpus%20such%20as%20Early%20English%20Books%20Online%20%28EEBO%29%20there%20must%20be%20terms%20that%20have%20%5Cu2018turbulent%20history%5Cu2019%5Cu2014the%20frequency%20of%20such%20terms%20features%20sharp%20drops%20and%20rises%20over%20time.%20Relative%20document%20frequency%20is%20explained%2C%20and%20volatility%20from%20financial%20mathematics%20is%20adopted%20to%20find%20an%20example%20of%20a%20term%20with%20a%20%5Cu2018turbulent%20history%5Cu2019%20in%20a%20subcorpus%20of%20the%20EEBO.%22%2C%22bookTitle%22%3A%22Applying%20Language%20Technology%20in%20Humanities%20Research%3A%20Design%2C%20Application%2C%20and%20the%20Underlying%20Logic%22%2C%22date%22%3A%222020%22%2C%22language%22%3A%22en%22%2C%22ISBN%22%3A%22978-3-030-46493-6%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1007%5C%2F978-3-030-46493-6_3%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222020-12-09T07%3A06%3A53Z%22%7D%7D%2C%7B%22key%22%3A%22SXXAQ6C9%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Froehlich%22%2C%22parsedDate%22%3A%222015-06-19%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EFroehlich%2C%20Heather%2C%20%26%23x2018%3BCorpus%20Analysis%20with%20Antconc%26%23x2019%3B%2C%20%3Ci%3EProgramming%20Historian%3C%5C%2Fi%3E%2C%202015%20%26lt%3B%3Ca%20class%3D%27zp-ItemURL%27%20href%3D%27https%3A%5C%2F%5C%2Fprogramminghistorian.org%5C%2Fen%5C%2Flessons%5C%2Fcorpus-analysis-with-antconc%27%3Ehttps%3A%5C%2F%5C%2Fprogramminghistorian.org%5C%2Fen%5C%2Flessons%5C%2Fcorpus-analysis-with-antconc%3C%5C%2Fa%3E%26gt%3B%20%5Baccessed%2015%20February%202021%5D%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22journalArticle%22%2C%22title%22%3A%22Corpus%20Analysis%20with%20Antconc%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Heather%22%2C%22lastName%22%3A%22Froehlich%22%7D%5D%2C%22abstractNote%22%3A%22%22%2C%22date%22%3A%222015-06-19%22%2C%22language%22%3A%22en%22%2C%22DOI%22%3A%22%22%2C%22ISSN%22%3A%22%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fprogramminghistorian.org%5C%2Fen%5C%2Flessons%5C%2Fcorpus-analysis-with-antconc%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222021-02-15T07%3A32%3A53Z%22%7D%7D%2C%7B%22key%22%3A%22RGKUL477%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A228821%2C%22username%22%3A%22christof.s%22%2C%22name%22%3A%22Christof%20Sch%5Cu00f6ch%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Fchristof.s%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Savoy%22%2C%22parsedDate%22%3A%222015-06-01%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3ESavoy%2C%20Jacques%2C%20%26%23x2018%3BComparative%20Evaluation%20of%20Term%20Selection%20Functions%20for%20Authorship%20Attribution%26%23x2019%3B%2C%20%3Ci%3ELiterary%20and%20Linguistic%20Computing%3C%5C%2Fi%3E%2C%2030.2%20%282015%29%2C%20pp.%20246%26%23x2013%3B61%2C%20%3Ca%20class%3D%27zp-DOIURL%27%20href%3D%27http%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1093%5C%2Fllc%5C%2Ffqt047%27%3Ehttp%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1093%5C%2Fllc%5C%2Ffqt047%3C%5C%2Fa%3E%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22journalArticle%22%2C%22title%22%3A%22Comparative%20evaluation%20of%20term%20selection%20functions%20for%20authorship%20attribution%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Jacques%22%2C%22lastName%22%3A%22Savoy%22%7D%5D%2C%22abstractNote%22%3A%22Different%20computational%20models%20have%20been%20proposed%20to%20automatically%20determine%20the%20most%20probable%20author%20of%20a%20disputed%20text%20%28authorship%20attribution%29.%20These%20models%20can%20be%20viewed%20as%20special%20approaches%20in%20the%20text%20categorization%20domain.%20In%20this%20perspective%2C%20in%20a%20first%20step%20we%20need%20to%20determine%20the%20most%20effective%20features%20%28words%2C%20punctuation%20symbols%2C%20part-of-speech%2C%20bigram%20of%20words%2C%20etc.%29%20to%20discriminate%20between%20different%20authors.%20To%20achieve%20this%2C%20we%20can%20consider%20different%20independent%20feature-scoring%20selection%20functions%20%28information%20gain%2C%20gain%20ratio%2C%20pointwise%20mutual%20information%2C%20odds%20ratio%2C%20chi-square%2C%20bi-normal%20separation%2C%20GSS%2C%20Darmstadt%20Indexing%20Approach%20%28DIA%29%2C%20and%20correlation%20coefficient%29.%20Other%20term%20selection%20strategies%20have%20also%20been%20suggested%20in%20specific%20authorship%20attribution%20studies.%20To%20compare%20these%20two%20families%20of%20selection%20procedures%2C%20we%20have%20extracted%20articles%20from%20two%20newspapers%20and%20belonging%20to%20two%20categories%20%28sports%20and%20politics%29.%20To%20enlarge%20the%20basis%20of%20our%20evaluations%2C%20we%20have%20chosen%20one%20newspaper%20written%20in%20the%20English%20language%20%28%5Cu2018Glasgow%20Herald%5Cu2019%29%20and%20a%20second%20one%20in%20Italian%20%28%5Cu2018La%20Stampa%5Cu2019%29.%20The%20resulting%20collections%20contain%20from%20987%20to%202%2C036%20articles%20written%20by%20four%20to%20ten%20columnists.%20Using%20the%20Kullback%5Cu2013Leibler%20divergence%2C%20the%20chi-square%20measure%20and%20the%20Delta%20rule%20as%20attribution%20schemes%2C%20this%20study%20found%20that%20some%20simple%20selection%20strategies%20%28based%20on%20occurrence%20frequency%20or%20document%20frequency%29%20may%20produce%20similar%2C%20and%20sometimes%20better%2C%20results%20compared%20with%20more%20complex%20ones.%22%2C%22date%22%3A%222015%5C%2F06%5C%2F01%22%2C%22language%22%3A%22en%22%2C%22DOI%22%3A%2210.1093%5C%2Fllc%5C%2Ffqt047%22%2C%22ISSN%22%3A%220268-1145%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Facademic.oup.com%5C%2Fdsh%5C%2Farticle%5C%2F30%5C%2F2%5C%2F246%5C%2F388297%22%2C%22collections%22%3A%5B%22NG7P7RZR%22%5D%2C%22dateModified%22%3A%222020-10-13T05%3A30%3A30Z%22%7D%7D%2C%7B%22key%22%3A%22762C9FFE%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Gries%22%2C%22parsedDate%22%3A%222015-04-01%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EGries%2C%20Stefan%20Th.%2C%20%26%23x2018%3BThe%20Most%20Under-Used%20Statistical%20Method%20in%20Corpus%20Linguistics%3A%20Multi-Level%20%28and%20Mixed-Effects%29%20Models%26%23x2019%3B%2C%20%3Ci%3ECorpora%3C%5C%2Fi%3E%2C%2010.1%20%282015%29%2C%20pp.%2095%26%23x2013%3B125%2C%20%3Ca%20class%3D%27zp-DOIURL%27%20href%3D%27http%3A%5C%2F%5C%2Fdoi.org%5C%2F10.3366%5C%2Fcor.2015.0068%27%3Ehttp%3A%5C%2F%5C%2Fdoi.org%5C%2F10.3366%5C%2Fcor.2015.0068%3C%5C%2Fa%3E%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22journalArticle%22%2C%22title%22%3A%22The%20most%20under-used%20statistical%20method%20in%20corpus%20linguistics%3A%20multi-level%20%28and%20mixed-effects%29%20models%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Stefan%20Th.%22%2C%22lastName%22%3A%22Gries%22%7D%5D%2C%22abstractNote%22%3A%22Much%20statistical%20analysis%20of%20psycholinguistic%20data%20is%20now%20being%20done%20with%20so-called%20mixed-effects%20regression%20models.%20This%20development%20was%20spearheaded%20by%20a%20few%20highly%20influential%20introductory%20articles%20that%20%28i%29%20showed%20how%20these%20regression%20models%20are%20superior%20to%20what%20was%20the%20previous%20gold%20standard%20and%2C%20perhaps%20even%20more%20importantly%2C%20%28ii%29%20showed%20how%20these%20models%20are%20used%20practically.%20Corpus%20linguistics%20can%20benefit%20from%20mixed-effects%5C%2Fmulti-level%20models%20for%20the%20same%20reason%20that%20psycholinguistics%20can%5Cu2009%5Cu2013%5Cu2009because%2C%20for%20example%2C%20speaker-specific%20and%20lexically%20specific%20idiosyncrasies%20can%20be%20accounted%20for%20elegantly%3B%20but%2C%20in%20fact%2C%20corpus%20linguistics%20needs%20them%20even%20more%20because%20%28i%29%20corpus-linguistic%20data%20are%20observational%20and%2C%20thus%2C%20usually%20unbalanced%20and%20messy%5C%2Fnoisy%2C%20and%20%28ii%29%20most%20widely%20used%20corpora%20come%20with%20a%20hierarchical%20structure%20that%20corpus%20linguists%20routinely%20fail%20to%20consider.%20Unlike%20nearly%20all%20overviews%20of%20mixed-effects%5C%2Fmulti-level%20modelling%2C%20this%20paper%20is%20specifically%20written%20for%20corpus%20linguists%20to%20get%20more%20of%20them%20to%20start%20using%20these%20techniques%20more.%20After%20a%20short%20methodological%20history%2C%20I%20provide%20a%20non-technical%20introduction%20to%20mixed-effects%20models%20and%20then%20discuss%20in%20detail%20one%20example%5Cu2009%5Cu2013%5Cu2009particle%20placement%20in%20English%5Cu2009%5Cu2013%5Cu2009to%20show%20how%20mixed-effects%5C%2Fmulti-level%20modelling%20results%20can%20be%20obtained%20and%20how%20they%20are%20far%20superior%20to%20those%20of%20traditional%20regression%20modelling.%22%2C%22date%22%3A%22April%201%2C%202015%22%2C%22language%22%3A%22%22%2C%22DOI%22%3A%2210.3366%5C%2Fcor.2015.0068%22%2C%22ISSN%22%3A%221749-5032%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fwww.euppublishing.com%5C%2Fdoi%5C%2Fabs%5C%2F10.3366%5C%2Fcor.2015.0068%22%2C%22collections%22%3A%5B%22NG7P7RZR%22%5D%2C%22dateModified%22%3A%222022-01-05T18%3A45%3A30Z%22%7D%7D%2C%7B%22key%22%3A%22IN2HGQT5%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A228821%2C%22username%22%3A%22christof.s%22%2C%22name%22%3A%22Christof%20Sch%5Cu00f6ch%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Fchristof.s%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Bestgen%22%2C%22parsedDate%22%3A%222014-06-01%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EBestgen%2C%20Yves%2C%20%26%23x2018%3BInadequacy%20of%20the%20Chi-Squared%20Test%20to%20Examine%20Vocabulary%20Differences%20between%20Corpora%26%23x2019%3B%2C%20%3Ci%3ELiterary%20and%20Linguistic%20Computing%3C%5C%2Fi%3E%2C%2029.2%20%282014%29%2C%20pp.%20164%26%23x2013%3B70%2C%20%3Ca%20class%3D%27zp-DOIURL%27%20href%3D%27http%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1093%5C%2Fllc%5C%2Ffqt020%27%3Ehttp%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1093%5C%2Fllc%5C%2Ffqt020%3C%5C%2Fa%3E%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22journalArticle%22%2C%22title%22%3A%22Inadequacy%20of%20the%20chi-squared%20test%20to%20examine%20vocabulary%20differences%20between%20corpora%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Yves%22%2C%22lastName%22%3A%22Bestgen%22%7D%5D%2C%22abstractNote%22%3A%22Pearson%5Cu2019s%20chi-squared%20test%20is%20probably%20the%20most%20popular%20statistical%20test%20used%20in%20corpus%20linguistics%2C%20particularly%20for%20studying%20linguistic%20variations%20between%20corpora.%20Oakes%20and%20Farrow%20%282007%29%20proposed%20various%20adaptations%20of%20this%20test%20to%20allow%20for%20the%20simultaneous%20comparison%20of%20more%20than%20two%20corpora%20while%20also%20yielding%20an%20almost%20correct%20Type%20I%20error%20rate%20%28i.e.%20claiming%20that%20a%20word%20is%20most%20frequently%20found%20in%20a%20variety%20of%20English%2C%20when%20in%20actuality%20this%20is%20not%20the%20case%29.%20By%20means%20of%20resampling%20procedures%2C%20the%20present%20study%20shows%20that%20when%20used%20in%20this%20context%2C%20the%20chi-squared%20test%20produces%20far%20too%20many%20significant%20results%2C%20even%20in%20its%20modified%20version.%20Several%20potential%20approaches%20to%20circumventing%20this%20problem%20are%20discussed%20in%20the%20conclusion.%22%2C%22date%22%3A%222014%5C%2F06%5C%2F01%22%2C%22language%22%3A%22en%22%2C%22DOI%22%3A%2210.1093%5C%2Fllc%5C%2Ffqt020%22%2C%22ISSN%22%3A%220268-1145%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Facademic.oup.com%5C%2Fdsh%5C%2Farticle%5C%2F29%5C%2F2%5C%2F164%5C%2F974103%22%2C%22collections%22%3A%5B%22NG7P7RZR%22%5D%2C%22dateModified%22%3A%222020-10-13T05%3A30%3A30Z%22%7D%7D%2C%7B%22key%22%3A%22F2VKYUK3%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A5206995%2C%22username%22%3A%22roettgermann%22%2C%22name%22%3A%22%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Froettgermann%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Lijffijt%20et%20al.%22%2C%22parsedDate%22%3A%222014%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3ELijffijt%2C%20Jefrey%2C%20Terttu%20Nevalainen%2C%20Tanja%20S%26%23xE4%3Bily%2C%20Panagiotis%20Papapetrou%2C%20Kai%20Puolam%26%23xE4%3Bki%2C%20and%20Heikki%20Mannila%2C%20%26%23x2018%3BSignificance%20Testing%20of%20Word%20Frequencies%20in%20Corpora%26%23x2019%3B%2C%20%3Ci%3EDigital%20Scholarship%20in%20the%20Humanities%3C%5C%2Fi%3E%2C%2031.2%20%282014%29%2C%20pp.%20374%26%23x2013%3B97%2C%20%3Ca%20class%3D%27zp-DOIURL%27%20href%3D%27http%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1093%5C%2Fllc%5C%2Ffqu064%27%3Ehttp%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1093%5C%2Fllc%5C%2Ffqu064%3C%5C%2Fa%3E%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22journalArticle%22%2C%22title%22%3A%22Significance%20testing%20of%20word%20frequencies%20in%20corpora%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Jefrey%22%2C%22lastName%22%3A%22Lijffijt%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Terttu%22%2C%22lastName%22%3A%22Nevalainen%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Tanja%22%2C%22lastName%22%3A%22S%5Cu00e4ily%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Panagiotis%22%2C%22lastName%22%3A%22Papapetrou%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Kai%22%2C%22lastName%22%3A%22Puolam%5Cu00e4ki%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Heikki%22%2C%22lastName%22%3A%22Mannila%22%7D%5D%2C%22abstractNote%22%3A%22Finding%20out%20whether%20a%20word%20occurs%20significantly%20more%20often%20in%20one%20text%20or%20corpus%20than%20in%20another%20is%20an%20important%20question%20in%20analysing%20corpora.%20As%20noted%20by%20Kilgarriff%20%28Language%20is%20never%2C%20ever%2C%20ever%2C%20random%2C%20Corpus%20Linguistics%20and%20Linguistic%20Theory%20%2C%202005%3B%201%282%29%3A%20263%5Cu201376.%29%2C%20the%20use%20of%20the%20%5Cu03c7%202%20and%20log-likelihood%20ratio%20tests%20is%20problematic%20in%20this%20context%2C%20as%20they%20are%20based%20on%20the%20assumption%20that%20all%20samples%20are%20statistically%20independent%20of%20each%20other.%20However%2C%20words%20within%20a%20text%20are%20not%20independent.%20As%20pointed%20out%20in%20Kilgarriff%20%28Comparing%20corpora%2C%20International%20Journal%20of%20Corpus%20Linguistics%20%2C%202001%3B%206%281%29%3A%201%5Cu201337%29%20and%20Paquot%20and%20Bestgen%20%28Distinctive%20words%20in%20academic%20writing%3A%20a%20comparison%20of%20three%20statistical%20tests%20for%20keyword%20extraction.%20In%20Jucker%2C%20A.%2C%20Schreier%2C%20D.%2C%20and%20Hundt%2C%20M.%20%28eds%29%2C%20Corpora%3A%20Pragmatics%20and%20Discourse%20.%20Amsterdam%3A%20Rodopi%2C%202009%2C%20pp.%20247%5Cu201369%29%2C%20it%20is%20possible%20to%20represent%20the%20data%20differently%20and%20employ%20other%20tests%2C%20such%20that%20we%20assume%20independence%20at%20the%20level%20of%20texts%20rather%20than%20individual%20words.%20This%20allows%20us%20to%20account%20for%20the%20distribution%20of%20words%20within%20a%20corpus.%20In%20this%20article%20we%20compare%20the%20significance%20estimates%20of%20various%20statistical%20tests%20in%20a%20controlled%20resampling%20experiment%20and%20in%20a%20practical%20setting%2C%20studying%20differences%20between%20texts%20produced%20by%20male%20and%20female%20fiction%20writers%20in%20the%20British%20National%20Corpus.%20We%20find%20that%20the%20choice%20of%20the%20test%2C%20and%20hence%20data%20representation%2C%20matters.%20We%20conclude%20that%20significance%20testing%20can%20be%20used%20to%20find%20consequential%20differences%20between%20corpora%2C%20but%20that%20assuming%20independence%20between%20all%20words%20may%20lead%20to%20overestimating%20the%20significance%20of%20the%20observed%20differences%2C%20especially%20for%20poorly%20dispersed%20words.%20We%20recommend%20the%20use%20of%20the%20t-test%2C%20Wilcoxon%20rank-sum%20test%2C%20or%20bootstrap%20test%20for%20comparing%20word%20frequencies%20across%20corpora.%22%2C%22date%22%3A%222014%22%2C%22language%22%3A%22en%22%2C%22DOI%22%3A%2210.1093%5C%2Fllc%5C%2Ffqu064%22%2C%22ISSN%22%3A%222055-7671%2C%202055-768X%22%2C%22url%22%3A%22http%3A%5C%2F%5C%2Fdsh.oxfordjournals.org%5C%2Flookup%5C%2Fdoi%5C%2F10.1093%5C%2Fllc%5C%2Ffqu064%22%2C%22collections%22%3A%5B%22IUKRIB7T%22%2C%224MZ8ZP2B%22%5D%2C%22dateModified%22%3A%222024-02-20T09%3A03%3A52Z%22%7D%7D%2C%7B%22key%22%3A%22R2QT74MF%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A228821%2C%22username%22%3A%22christof.s%22%2C%22name%22%3A%22Christof%20Sch%5Cu00f6ch%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Fchristof.s%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Parsons%20et%20al.%22%2C%22parsedDate%22%3A%222009-04%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EParsons%2C%20Kathryn%2C%20Agata%20McCormac%2C%20and%20Marcus%20Butavicius%2C%20%3Ci%3EHuman%20Dimensions%20of%20Corpora%20Comparison%3A%20An%20Analysis%20of%20Kilgarriff%26%23x2019%3Bs%20%282001%29%20Approach%3C%5C%2Fi%3E%20%28DEFENCE%20SCIENCE%20AND%20TECHNOLOGY%20ORGANISATION%20EDINBURGH%20%28AUSTRALIA%29%20COMMAND%20CONTROL%20COMMUNICATIONS%20AND%20INTELLIGENCE%20DIV%2C%20April%202009%29%20%26lt%3B%3Ca%20class%3D%27zp-ItemURL%27%20href%3D%27https%3A%5C%2F%5C%2Fapps.dtic.mil%5C%2Fdocs%5C%2Fcitations%5C%2FADA506585%27%3Ehttps%3A%5C%2F%5C%2Fapps.dtic.mil%5C%2Fdocs%5C%2Fcitations%5C%2FADA506585%3C%5C%2Fa%3E%26gt%3B%20%5Baccessed%2017%20September%202019%5D%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22report%22%2C%22title%22%3A%22Human%20Dimensions%20of%20Corpora%20Comparison%3A%20An%20Analysis%20of%20Kilgarriff%27s%20%282001%29%20Approach%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Kathryn%22%2C%22lastName%22%3A%22Parsons%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Agata%22%2C%22lastName%22%3A%22McCormac%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Marcus%22%2C%22lastName%22%3A%22Butavicius%22%7D%5D%2C%22abstractNote%22%3A%22There%20is%20a%20distinct%20lack%20of%20tools%20that%20provide%20a%20comprehensive%20measure%20of%20the%20similarity%20between%20corpora.%20Finding%20similar%20corpora%20is%20necessary%20for%20the%20design%20of%20certain%20user%20studies%20investigating%20text%20processing.%20It%20is%20also%20useful%20for%20ensuring%20comparability%20between%20studies%20on%20document%20analysis%20conducted%20across%20classified%20and%20unclassified%20domains.%20In%20this%20study%2C%20human%20judgements%20of%20corpora%20similarity%20were%20obtained%20as%20a%20gold%20standard.%20These%20were%20then%20compared%20to%20the%20values%20provided%20by%20Kilgarriff%27s%20%282001%29%20chi-square%20%28X2%29%20statistic.%20The%20findings%20indicated%20a%20high%20level%20of%20agreement%20between%20the%20participants%2C%20with%2077%25%20shared%20variance%20in%20overall%20similarity%20judgements.%20The%20results%20of%20the%20X2%20measure%20also%20correlated%20well%20with%20the%20human%20results%2C%20with%20a%20correlation%20of%20approximately%200.66.%20Although%20there%20are%20complexities%20associated%20with%20the%20X2%20technique%20that%20need%20to%20be%20examined%20in%20further%20research%2C%20this%20study%20provides%20extremely%20promising%20results%2C%20suggesting%20that%20a%20statistical%20technique%20could%20provide%20results%20that%20are%20comparable%20to%20human%20judgements.%22%2C%22reportNumber%22%3A%22DSTO-TR-2290%22%2C%22reportType%22%3A%22%22%2C%22institution%22%3A%22DEFENCE%20SCIENCE%20AND%20TECHNOLOGY%20ORGANISATION%20EDINBURGH%20%28AUSTRALIA%29%20COMMAND%20CONTROL%20COMMUNICATIONS%20AND%20INTELLIGENCE%20DIV%22%2C%22date%22%3A%22Apr%202009%22%2C%22language%22%3A%22en%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fapps.dtic.mil%5C%2Fdocs%5C%2Fcitations%5C%2FADA506585%22%2C%22collections%22%3A%5B%22NG7P7RZR%22%5D%2C%22dateModified%22%3A%222020-10-13T05%3A30%3A30Z%22%7D%7D%2C%7B%22key%22%3A%22LCMGKNC2%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A228821%2C%22username%22%3A%22christof.s%22%2C%22name%22%3A%22Christof%20Sch%5Cu00f6ch%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Fchristof.s%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22L%5Cu00fcdeling%20and%20Kyt%5Cu00f6%22%2C%22parsedDate%22%3A%222009-03-18%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EL%26%23xFC%3Bdeling%2C%20Anke%2C%20and%20Merja%20Kyt%26%23xF6%3B%2C%20eds.%2C%20%26%23x2018%3BStatistical%20Methods%20for%20Corpus%20Exploitation%26%23x2019%3B%2C%20in%20%3Ci%3EHandbooks%20of%20Linguistics%20and%20Communication%20Science%3C%5C%2Fi%3E%20%28Mouton%20de%20Gruyter%2C%202009%29%2C%20doi%3A10.1515%5C%2F9783110213881.2.777%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22bookSection%22%2C%22title%22%3A%22Statistical%20methods%20for%20corpus%20exploitation%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22editor%22%2C%22firstName%22%3A%22Anke%22%2C%22lastName%22%3A%22L%5Cu00fcdeling%22%7D%2C%7B%22creatorType%22%3A%22editor%22%2C%22firstName%22%3A%22Merja%22%2C%22lastName%22%3A%22Kyt%5Cu00f6%22%7D%5D%2C%22abstractNote%22%3A%22Linguists%20look%20for%20generalizations%20and%20explanations%20of%20various%20kinds%20of%20linguistic%20phenomena.%20While%20the%20interest%20is%20usually%20in%20an%20intensional%20view%20of%20these%20phenomena%2C%20to%20be%5Cnexplained%20in%20terms%20of%20the%20human%20language%20competence%2C%20such%20competence%20cannot%20be%20directly%20observed.%20Thus%2C%20evidence%20has%20to%20come%20from%20an%20external%20reflection%20of%20it%2C%20i.e.%2C%20it%20has%5Cnto%20be%20based%20on%20an%20extensional%20view%20of%20language.%20According%20to%20this%20extensional%20view%2C%20a%5Cnlanguage%20is%20defined%20as%20the%20set%20of%20all%20utterances%20produced%20by%20speakers%20of%20the%20language%20%28with%5Cnall%20the%20paradoxes%20that%20this%20view%20implies%20%5Cu2013%20see%2C%20e.g.%2C%20Chomsky%201986%2C%20chapter%202%29.%20Corpora%20are%5Cnfinite%20samples%20from%20the%20infinite%20set%20that%20constitutes%20a%20language%20in%20this%20extensional%20sense.%5CnFor%20example%2C%20in%20this%20perspective%2C%20the%20Brown%20corpus%20%28see%20Article%2022%29%20is%20a%20finite%20sample%20of%5Cnall%20the%20utterances%20produced%20in%20written%20form%20by%20American%20English%20speakers.%20Psycholinguistic%20experiments%2C%20such%20as%20eye-tracking%20tests%2C%20priming%2C%20and%20even%20traditional%20grammaticality%5Cnjudgments%20%28Sch%5Cu00a8utze%201996%29%20constitute%20other%20sources%20of%20evidence.%20It%20is%20important%20to%20observe%5Cnthat%20the%20empirical%20analysis%20of%20these%20other%20sources%20also%20requires%20an%20extensional%20view%20of%5Cnlanguage.%22%2C%22bookTitle%22%3A%22Handbooks%20of%20Linguistics%20and%20Communication%20Science%22%2C%22date%22%3A%222009-03-18%22%2C%22language%22%3A%22%22%2C%22ISBN%22%3A%22978-3-11-021388-1%22%2C%22url%22%3A%22http%3A%5C%2F%5C%2Fwww.degruyter.com%5C%2Fview%5C%2Fbooks%5C%2F9783110213881.2%5C%2F9783110213881.2.777%5C%2F9783110213881.2.777.xml%22%2C%22collections%22%3A%5B%22NG7P7RZR%22%2C%223NJGA7NT%22%5D%2C%22dateModified%22%3A%222020-10-13T05%3A30%3A30Z%22%7D%7D%2C%7B%22key%22%3A%22Y26YQ6QW%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A1194126%2C%22username%22%3A%22dkltimon%22%2C%22name%22%3A%22Keli%20Du%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Fdkltimon%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Oakes%20and%20Farrow%22%2C%22parsedDate%22%3A%222007-04-01%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EOakes%2C%20Michael%20P.%2C%20and%20Malcolm%20Farrow%2C%20%26%23x2018%3BUse%20of%20the%20Chi-Squared%20Test%20to%20Examine%20Vocabulary%20Differences%20in%20English%20Language%20Corpora%20Representing%20Seven%20Different%20Countries%26%23x2019%3B%2C%20%3Ci%3ELiterary%20and%20Linguistic%20Computing%3C%5C%2Fi%3E%2C%2022.1%20%282007%29%2C%20pp.%2085%26%23x2013%3B99%2C%20%3Ca%20class%3D%27zp-DOIURL%27%20href%3D%27http%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1093%5C%2Fllc%5C%2Ffql044%27%3Ehttp%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1093%5C%2Fllc%5C%2Ffql044%3C%5C%2Fa%3E%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22journalArticle%22%2C%22title%22%3A%22Use%20of%20the%20Chi-Squared%20Test%20to%20Examine%20Vocabulary%20Differences%20in%20English%20Language%20Corpora%20Representing%20Seven%20Different%20Countries%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Michael%20P.%22%2C%22lastName%22%3A%22Oakes%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Malcolm%22%2C%22lastName%22%3A%22Farrow%22%7D%5D%2C%22abstractNote%22%3A%22Abstract.%20%20The%20chi-squared%20test%20is%20used%20to%20find%20the%20vocabulary%20most%20typical%20of%20seven%20different%20ICAME%20corpora%2C%20each%20representing%20the%20English%20used%20in%20a%20particular%22%2C%22date%22%3A%222007%5C%2F04%5C%2F01%22%2C%22language%22%3A%22en%22%2C%22DOI%22%3A%2210.1093%5C%2Fllc%5C%2Ffql044%22%2C%22ISSN%22%3A%220268-1145%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Facademic.oup.com%5C%2Fdsh%5C%2Farticle%5C%2F22%5C%2F1%5C%2F85%5C%2F1025876%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222021-12-15T19%3A52%3A24Z%22%7D%7D%2C%7B%22key%22%3A%22QVBMECF9%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A5935700%2C%22username%22%3A%22yulyadudar%22%2C%22name%22%3A%22Iuliia%20Dudar%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Fyulyadudar%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Anthony%22%2C%22parsedDate%22%3A%222005-08-10%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EAnthony%2C%20Laurence%2C%20%26%23x2018%3BAntConc%3A%20Design%20and%20Development%20of%20a%20Freeware%20Corpus%20Analysis%20Toolkit%20for%20the%20Technical%20Writing%20Classroom%26%23x2019%3B%2C%202005%2C%20pp.%20729%26%23x2013%3B37%2C%20%3Ca%20class%3D%27zp-DOIURL%27%20href%3D%27http%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1109%5C%2FIPCC.2005.1494244%27%3Ehttp%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1109%5C%2FIPCC.2005.1494244%3C%5C%2Fa%3E%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22conferencePaper%22%2C%22title%22%3A%22AntConc%3A%20Design%20and%20development%20of%20a%20freeware%20corpus%20analysis%20toolkit%20for%20the%20technical%20writing%20classroom%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Laurence%22%2C%22lastName%22%3A%22Anthony%22%7D%5D%2C%22abstractNote%22%3A%22In%20this%20paper%2C%20the%20author%20describes%20AntConc%2C%20a%20freeware%2C%20multi-platform%2C%20multi-purpose%20corpus%20analysis%20toolkit%2C%20designed%20by%20the%20author%20for%20specific%20use%20in%20the%20classroom.%20AntConc%20includes%20a%20powerful%20concor%20dancer%2C%20word%20and%20keyword%20frequency%20generators%2C%20tools%20for%20cluster%20and%20lexical%20bundle%20analysis%2C%20and%20a%20word%20distribution%20plot.%20It%20also%20offers%20the%20choice%20of%20simple%20wildcard%20searches%20or%20powerful%20regular%20expression%20searches%2C%20and%20has%20an%20extremely%20easy-to-use%2C%20intuitive%20interface.%20After%20explaining%20the%20background%20to%20AntConc%2C%20the%20author%20gives%20an%20overview%20of%20each%20of%20its%20tools%2C%20and%20explains%20the%20value%20to%20learners.%20Then%2C%20the%20author%20discusses%20the%20current%20limitations%20of%20the%20software%2C%20before%20explaining%20how%20these%20will%20be%20addressed%20in%20the%20future.%22%2C%22date%22%3A%22August%2010%2C%202005%22%2C%22proceedingsTitle%22%3A%22%22%2C%22conferenceName%22%3A%22Proceedings%20of%20Professional%20Communication%20Conference%22%2C%22language%22%3A%22%22%2C%22DOI%22%3A%2210.1109%5C%2FIPCC.2005.1494244%22%2C%22ISBN%22%3A%22978-0-7803-9027-0%22%2C%22url%22%3A%22%22%2C%22collections%22%3A%5B%22IUKRIB7T%22%5D%2C%22dateModified%22%3A%222022-01-05T19%3A23%3A49Z%22%7D%7D%2C%7B%22key%22%3A%22TQK9KKEG%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A228821%2C%22username%22%3A%22christof.s%22%2C%22name%22%3A%22Christof%20Sch%5Cu00f6ch%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Fchristof.s%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Lancaster%20and%20Seneta%22%2C%22parsedDate%22%3A%222005-07-15%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3ELancaster%2C%20H.%20O.%2C%20and%20E.%20Seneta%2C%20%26%23x2018%3BChi-Square%20Distribution%26%23x2019%3B%2C%20in%20%3Ci%3EEncyclopedia%20of%20Biostatistics%3C%5C%2Fi%3E%2C%20ed.%20by%20Peter%20Armitage%20and%20Theodore%20Colton%20%28John%20Wiley%20%26amp%3B%20Sons%2C%20Ltd%2C%202005%29%2C%20p.%20b2a15018%2C%20doi%3A10.1002%5C%2F0470011815.b2a15018%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22bookSection%22%2C%22title%22%3A%22Chi-Square%20Distribution%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22editor%22%2C%22firstName%22%3A%22Peter%22%2C%22lastName%22%3A%22Armitage%22%7D%2C%7B%22creatorType%22%3A%22editor%22%2C%22firstName%22%3A%22Theodore%22%2C%22lastName%22%3A%22Colton%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22H.%20O.%22%2C%22lastName%22%3A%22Lancaster%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22E.%22%2C%22lastName%22%3A%22Seneta%22%7D%5D%2C%22abstractNote%22%3A%22A%20chi%5Cu2010square%20random%20variable%20is%20defined%20as%20the%20sum%20of%20squares%20of%20independently%20distributed%20standard%20normal%20random%20variables%2C%20which%20explains%20the%20additive%20property%20of%20independent%20chi%5Cu2010square%20random%20variables.%20Its%20probability%20distribution%20is%20described%20by%20a%20gamma%20probability%20density.%20The%20chi%5Cu2010square%20goodness%5Cu2010of%5Cu2010fit%20statistic%2C%20when%20sample%20size%20is%20large%2C%20is%20approximately%20a%20chi%5Cu2010square%20random%20variable.%20Tests%20of%20hypotheses%20relating%20to%20contingency%20tables%20are%20also%20based%20on%20a%20statistic%20with%20approximate%20chi%5Cu2010square%20distribution.%22%2C%22bookTitle%22%3A%22Encyclopedia%20of%20Biostatistics%22%2C%22date%22%3A%222005-07-15%22%2C%22language%22%3A%22en%22%2C%22ISBN%22%3A%22978-0-470-84907-1%20978-0-470-01181-2%22%2C%22url%22%3A%22http%3A%5C%2F%5C%2Fdoi.wiley.com%5C%2F10.1002%5C%2F0470011815.b2a15018%22%2C%22collections%22%3A%5B%22NG7P7RZR%22%5D%2C%22dateModified%22%3A%222020-10-13T05%3A30%3A30Z%22%7D%7D%2C%7B%22key%22%3A%2249T2NLQJ%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A5206995%2C%22username%22%3A%22roettgermann%22%2C%22name%22%3A%22%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Froettgermann%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Rayson%22%2C%22parsedDate%22%3A%222005%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3ERayson%2C%20Paul%2C%20%26%23x2018%3BWmatrix%3A%20A%20Web-Based%20Corpus%20Processing%20Environment.%26%23x2019%3B%20%28Computing%20Department%2C%20Lancaster%20University%2C%202005%29%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22bookSection%22%2C%22title%22%3A%22Wmatrix%3A%20a%20web-based%20corpus%20processing%20environment.%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Paul%22%2C%22lastName%22%3A%22Rayson%22%7D%5D%2C%22abstractNote%22%3A%22%22%2C%22bookTitle%22%3A%22%22%2C%22date%22%3A%222005%22%2C%22language%22%3A%22English%22%2C%22ISBN%22%3A%22%22%2C%22url%22%3A%22%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222024-03-22T10%3A25%3A37Z%22%7D%7D%2C%7B%22key%22%3A%22CLLZITMM%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A228821%2C%22username%22%3A%22christof.s%22%2C%22name%22%3A%22Christof%20Sch%5Cu00f6ch%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Fchristof.s%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Gabriela%20Cavagli%5Cu00e0%22%2C%22parsedDate%22%3A%222002%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EGabriela%20Cavagli%26%23xE0%3B%2C%20%26%23x2018%3BMeasuring%20Corpus%20Homogeneity%20Using%20a%20Range%20of%20Measures%20for%20Inter-Document%20Distance%20Measuring%20Corpus%20Homogeneity%20Using%20a%20Range%20of%20Measures%20for%20Inter-Document%20Distance%20%7C%20Request%20PDF%26%23x2019%3B%2C%20%3Ci%3EResearchGate%3C%5C%2Fi%3E%2C%202002%20%26lt%3B%3Ca%20class%3D%27zp-ItemURL%27%20href%3D%27https%3A%5C%2F%5C%2Fwww.researchgate.net%5C%2Fpublication%5C%2F267784878_ITRI-02-08_Measuring_corpus_homogeneity_using_a_range_of_measures_for_inter-document_distance_Measuring_corpus_homogeneity_using_a_range_of_measures_for_inter-document_distance%27%3Ehttps%3A%5C%2F%5C%2Fwww.researchgate.net%5C%2Fpublication%5C%2F267784878_ITRI-02-08_Measuring_corpus_homogeneity_using_a_range_of_measures_for_inter-document_distance_Measuring_corpus_homogeneity_using_a_range_of_measures_for_inter-document_distance%3C%5C%2Fa%3E%26gt%3B%20%5Baccessed%2017%20September%202019%5D%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22webpage%22%2C%22title%22%3A%22Measuring%20corpus%20homogeneity%20using%20a%20range%20of%20measures%20for%20inter-document%20distance%20Measuring%20corpus%20homogeneity%20using%20a%20range%20of%20measures%20for%20inter-document%20distance%20%7C%20Request%20PDF%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22%22%2C%22lastName%22%3A%22Gabriela%20Cavagli%5Cu00e0%22%7D%5D%2C%22abstractNote%22%3A%22With%20the%20ever%20more%20widespread%20use%20of%20corpora%20in%20language%20research%2C%20it%20is%20becoming%20increasingly%20important%20to%20be%20able%20to%20describe%20and%20compare%20corpora.%20The%20analysis%20of%20corpus%20homogeneity%20is%20preliminary%20to%20any%20quantitative%20approach%20to%20corpora%20comparison.%20We%20describe%20a%20method%20for%20text%20analysis%20based%20only%20on%20document-internal%20linguistic%20features%2C%20and%20a%20set%20of%20related%20homogeneity%20measures%20based%20on%20inter-document%20distance.%20We%20present%20a%20preliminary%20experiment%20to%20validate%20the%20hypothesis%20that%20in%20the%20presence%20of%20a%20homogeneous%20corpus%20the%20subcorpus%20that%20is%20necessary%20to%20train%20an%20NLP%20system%20is%20smaller%20than%20the%20one%20required%20if%20a%20heterogeneous%20corpus%20is%20used.%22%2C%22date%22%3A%222002%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fwww.researchgate.net%5C%2Fpublication%5C%2F267784878_ITRI-02-08_Measuring_corpus_homogeneity_using_a_range_of_measures_for_inter-document_distance_Measuring_corpus_homogeneity_using_a_range_of_measures_for_inter-document_distance%22%2C%22language%22%3A%22en%22%2C%22collections%22%3A%5B%22NG7P7RZR%22%5D%2C%22dateModified%22%3A%222020-10-13T05%3A30%3A24Z%22%7D%7D%2C%7B%22key%22%3A%22B72UB5BP%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Kilgarriff%22%2C%22parsedDate%22%3A%222001%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EKilgarriff%2C%20Adam%2C%20%26%23x2018%3BComparing%20Corpora%26%23x2019%3B%2C%20%3Ci%3EInternational%20Journal%20of%20Corpus%20Linguistics%3C%5C%2Fi%3E%2C%206.1%20%282001%29%2C%20pp.%2097%26%23x2013%3B133%2C%20%3Ca%20class%3D%27zp-DOIURL%27%20href%3D%27http%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1075%5C%2Fijcl.6.1.05kil%27%3Ehttp%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1075%5C%2Fijcl.6.1.05kil%3C%5C%2Fa%3E%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22journalArticle%22%2C%22title%22%3A%22Comparing%20Corpora%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Adam%22%2C%22lastName%22%3A%22Kilgarriff%22%7D%5D%2C%22abstractNote%22%3A%22Corpus%20linguistics%20lacks%20strategies%20for%20describing%20and%20comparing%20corpora.%20Currently%20most%20descriptions%20of%20corpora%20are%20textual%2C%20and%20questions%20such%20as%20%5Cu2018what%20sort%20of%20a%20corpus%20is%20this%3F%5Cu2019%2C%20or%20%5Cu2018how%20does%20this%20corpus%20compare%20to%20that%3F%5Cu2019%20can%20only%20be%20answered%20impressionistically.%20This%20paper%20considers%20various%20ways%20in%20which%20different%20corpora%20can%20be%20compared%20more%20objectively.%20First%20we%20address%20the%20issue%2C%20%5Cu2018which%20words%20are%20particularly%20characteristic%20of%20a%20corpus%3F%5Cu2019%2C%20reviewing%20and%20critiquing%20the%20statistical%20methods%20which%20have%20been%20applied%20to%20the%20question%20and%20proposing%20the%20use%20of%20the%20Mann-Whitney%20ranks%20test.%20Results%20of%20two%20corpus%20comparisons%20using%20the%20ranks%20test%20are%20presented.%20Then%2C%20we%20consider%20measures%20for%20corpus%20similarity.%20After%20discussing%20limitations%20of%20the%20idea%20of%20corpus%20similarity%2C%20we%20present%20a%20method%20for%20evaluating%20corpus%20similarity%20measures.%20We%20consider%20several%20measures%20and%20establish%20that%20a%20%5C%5Cchi%5C%5Ctsup%7B2%7D-based%20one%20performs%20best.%20All%20methods%20considered%20in%20this%20paper%20are%20based%20on%20word%20and%20ngram%20frequencies%3B%20the%20strategy%20is%20defended.%22%2C%22date%22%3A%222001%22%2C%22language%22%3A%22en%22%2C%22DOI%22%3A%2210.1075%5C%2Fijcl.6.1.05kil%22%2C%22ISSN%22%3A%221384-6655%2C%201569-9811%22%2C%22url%22%3A%22http%3A%5C%2F%5C%2Fwww.jbe-platform.com%5C%2Fcontent%5C%2Fjournals%5C%2F10.1075%5C%2Fijcl.6.1.05kil%22%2C%22collections%22%3A%5B%22IUKRIB7T%22%2C%22NG7P7RZR%22%5D%2C%22dateModified%22%3A%222024-02-20T09%3A03%3A30Z%22%7D%7D%2C%7B%22key%22%3A%22775JLQX4%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A5206995%2C%22username%22%3A%22roettgermann%22%2C%22name%22%3A%22%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Froettgermann%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Scott%22%2C%22parsedDate%22%3A%221997%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EScott%2C%20Mike%2C%20%26%23x2018%3BPC%20Analysis%20of%20Key%20Words%20and%20Key%20Key%20Words%26%23x2019%3B%2C%20%3Ci%3ESystem%3C%5C%2Fi%3E%2C%2025.2%20%281997%29%2C%20pp.%20233%26%23x2013%3B45%2C%20%3Ca%20class%3D%27zp-DOIURL%27%20href%3D%27http%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1016%5C%2FS0346-251X%2897%2900011-0%27%3Ehttp%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1016%5C%2FS0346-251X%2897%2900011-0%3C%5C%2Fa%3E%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22journalArticle%22%2C%22title%22%3A%22PC%20Analysis%20of%20key%20words%20and%20key%20key%20words%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Mike%22%2C%22lastName%22%3A%22Scott%22%7D%5D%2C%22abstractNote%22%3A%22PC%20analysis%20of%20key%20words%20%5Cu2014%20And%20key%20key%20words%22%2C%22date%22%3A%226%5C%2F1997%22%2C%22language%22%3A%22eng%22%2C%22DOI%22%3A%2210.1016%5C%2FS0346-251X%2897%2900011-0%22%2C%22ISSN%22%3A%220346251X%22%2C%22url%22%3A%22%22%2C%22collections%22%3A%5B%22IUKRIB7T%22%5D%2C%22dateModified%22%3A%222024-02-20T09%3A07%3A15Z%22%7D%7D%2C%7B%22key%22%3A%22HLA7H4H6%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A228821%2C%22username%22%3A%22christof.s%22%2C%22name%22%3A%22Christof%20Sch%5Cu00f6ch%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Fchristof.s%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Kilgarriff%22%2C%22parsedDate%22%3A%221997%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EKilgarriff%2C%20Adam%2C%20%26%23x2018%3BUsing%20Word%20Frequency%20Lists%20to%20Measure%20Corpus%20Homogeneity%20and%20Similarity%20between%20Corpora%26%23x2019%3B%2C%20in%20%3Ci%3EFifth%20Workshop%20on%20Very%20Large%20Corpora%3C%5C%2Fi%3E%2C%201997%20%26lt%3B%3Ca%20class%3D%27zp-ItemURL%27%20href%3D%27https%3A%5C%2F%5C%2Fwww.aclweb.org%5C%2Fanthology%5C%2FW97-0122%27%3Ehttps%3A%5C%2F%5C%2Fwww.aclweb.org%5C%2Fanthology%5C%2FW97-0122%3C%5C%2Fa%3E%26gt%3B%20%5Baccessed%206%20September%202019%5D%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22conferencePaper%22%2C%22title%22%3A%22Using%20Word%20Frequency%20Lists%20to%20Measure%20Corpus%20Homogeneity%20and%20Similarity%20between%20Corpora%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Adam%22%2C%22lastName%22%3A%22Kilgarriff%22%7D%5D%2C%22abstractNote%22%3A%22How%20similar%20are%20two%20corpora%3F%20A%20measure%20of%20corpus%20similarity%20would%20be%20very%20useful%20for%5Cnlexicography%20and%20language%20engineering.%20Word%20frequency%20lists%20are%20cheap%20and%20easy%20to%20generate%5Cnso%20a%20measure%20based%20on%20them%20would%20be%20of%20use%20as%20a%20quick%20guide%20in%20many%20circumstances%3B%20for%5Cnexample%2C%20to%20judge%20how%20a%20newly%20available%20corpus%20related%20to%20existing%20resources%2C%20or%20how%20easy%20it%5Cnmight%20be%20to%20port%20an%20NLP%20system%20designed%20to%20work%20with%20one%20text%20type%20to%20work%20with%20another.%5CnWe%20show%20that%20corpus%20similarity%20can%20only%20be%20interpreted%20in%20the%20light%20of%20corpus%20homogeneity.%5CnThe%20paper%20presents%20a%20measure%2C%20based%20on%20the%20XX%202%20statistic%2C%20for%20measuring%20both%20corpus%20similarity%5Cnand%20corpus%20homogeneity.%20The%20measure%20is%20compared%20with%20a%20rank-based%20measure%20and%20shown%5Cnto%20outperform%20it.%20Some%20results%20are%20presented.%20A%20method%20for%20evaluating%20the%20accuracy%20of%20the%5Cnmeasure%20is%20introduced%20and%20some%20results%20of%20using%20the%20measure%20are%20presented.%22%2C%22date%22%3A%221997%22%2C%22proceedingsTitle%22%3A%22Fifth%20Workshop%20on%20Very%20Large%20Corpora%22%2C%22conferenceName%22%3A%22%22%2C%22language%22%3A%22en%22%2C%22DOI%22%3A%22%22%2C%22ISBN%22%3A%22%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fwww.aclweb.org%5C%2Fanthology%5C%2FW97-0122%22%2C%22collections%22%3A%5B%22NG7P7RZR%22%5D%2C%22dateModified%22%3A%222020-10-13T05%3A30%3A30Z%22%7D%7D%2C%7B%22key%22%3A%22CIWQMVLM%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22creatorSummary%22%3A%22Kilgarriff%22%2C%22parsedDate%22%3A%221997%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EKilgarriff%2C%20Adam%2C%20%26%23x2018%3BUsing%20Word%20Frequency%20Lists%20to%20Measure%20Corpus%20Homogeneity%20and%20Similarity%20between%20Corpora%26%23x2019%3B%2C%20in%20%3Ci%3EFifth%20Workshop%20on%20Very%20Large%20Corpora%3C%5C%2Fi%3E%2C%201997%20%26lt%3B%3Ca%20class%3D%27zp-ItemURL%27%20href%3D%27https%3A%5C%2F%5C%2Fwww.aclweb.org%5C%2Fanthology%5C%2FW97-0122%27%3Ehttps%3A%5C%2F%5C%2Fwww.aclweb.org%5C%2Fanthology%5C%2FW97-0122%3C%5C%2Fa%3E%26gt%3B%20%5Baccessed%2012%20June%202020%5D%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22conferencePaper%22%2C%22title%22%3A%22Using%20Word%20Frequency%20Lists%20to%20Measure%20Corpus%20Homogeneity%20and%20Similarity%20between%20Corpora%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Adam%22%2C%22lastName%22%3A%22Kilgarriff%22%7D%5D%2C%22abstractNote%22%3A%22%22%2C%22date%22%3A%221997%22%2C%22proceedingsTitle%22%3A%22Fifth%20Workshop%20on%20Very%20Large%20Corpora%22%2C%22conferenceName%22%3A%22%22%2C%22language%22%3A%22%22%2C%22DOI%22%3A%22%22%2C%22ISBN%22%3A%22%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fwww.aclweb.org%5C%2Fanthology%5C%2FW97-0122%22%2C%22collections%22%3A%5B%5D%2C%22dateModified%22%3A%222020-06-12T09%3A55%3A06Z%22%7D%7D%2C%7B%22key%22%3A%22RKF7VLQ9%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A228821%2C%22username%22%3A%22christof.s%22%2C%22name%22%3A%22Christof%20Sch%5Cu00f6ch%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Fchristof.s%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Cressie%20and%20Read%22%2C%22parsedDate%22%3A%221989%22%2C%22numChildren%22%3A1%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3ECressie%2C%20Noel%20A.%20C.%2C%20and%20Timothy%20R.%20C.%20Read%2C%20%26%23x2018%3BPearsons-X2%20and%20the%20Loglikelihood%20Ratio%20Statistic-G2%3A%20A%20Comparative%20Review%26%23x2019%3B%2C%201989%2C%20%3Ca%20class%3D%27zp-DOIURL%27%20href%3D%27http%3A%5C%2F%5C%2Fdoi.org%5C%2F10.2307%5C%2F1403582%27%3Ehttp%3A%5C%2F%5C%2Fdoi.org%5C%2F10.2307%5C%2F1403582%3C%5C%2Fa%3E%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22conferencePaper%22%2C%22title%22%3A%22Pearsons-X2%20and%20the%20loglikelihood%20ratio%20statistic-G2%3A%20a%20comparative%20review%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Noel%20A.%20C.%22%2C%22lastName%22%3A%22Cressie%22%7D%2C%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Timothy%20R.%20C.%22%2C%22lastName%22%3A%22Read%22%7D%5D%2C%22abstractNote%22%3A%22Summary%20The%20importance%20of%20developing%20useful%20and%20appropriate%20statistical%20methods%20for%20analyzing%20discrete%20multivariate%20data%20is%20apparent%20from%20the%20enormous%20amount%20of%20attention%20this%20subject%20has%20commanded%20in%20the%20literature%20over%20the%20last%20thirty%20years.%20Central%20to%20these%20discussions%20has%20been%20Pearson%27s%20X2%20statistic%20and%20the%20loglikelihood%20ratio%20statistic%20G2.%20Our%20review%20seeks%20to%20consolidate%20this%20fragmented%20literature%20and%20develop%20a%20unifying%20theme%20for%20much%20of%20this%20research.%20The%20traditional%20X2%20and%20G2%20statistics%20are%20viewed%20as%20members%20of%20the%20power-divergence%20family%20of%20statistics%2C%20and%20are%20linked%20through%20a%20single%20real-valued%20parameter.%20The%20principal%20areas%20covered%20in%20this%20comparative%20survey%20are%20small-sample%20comparisons%20of%20X2%20and%20G2%20under%20both%20classical%20%28fixed-cells%29%20assumptions%20and%20sparseness%20assumptions%2C%20efficiency%20comparisons%2C%20and%20various%20modifications%20to%20the%20test%20statistics%20%28including%20parameter%20estimation%20for%20ungrouped%20data%2C%20data-dependent%20and%20overlapping%20cell%20boundaries%2C%20serially%20dependent%20data%2C%20and%20smoothing%29.%20Finally%20some%20future%20areas%20for%20research%20are%20discussed.%22%2C%22date%22%3A%221989%22%2C%22proceedingsTitle%22%3A%22%22%2C%22conferenceName%22%3A%22%22%2C%22language%22%3A%22%22%2C%22DOI%22%3A%2210.2307%5C%2F1403582%22%2C%22ISBN%22%3A%22%22%2C%22url%22%3A%22%22%2C%22collections%22%3A%5B%222CZHD96W%22%2C%22NG7P7RZR%22%5D%2C%22dateModified%22%3A%222020-10-13T05%3A30%3A30Z%22%7D%7D%2C%7B%22key%22%3A%22STJT6CPR%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A228821%2C%22username%22%3A%22christof.s%22%2C%22name%22%3A%22Christof%20Sch%5Cu00f6ch%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Fchristof.s%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Plackett%22%2C%22parsedDate%22%3A%221983%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EPlackett%2C%20R.%20L.%2C%20%26%23x2018%3BKarl%20Pearson%20and%20the%20Chi-Squared%20Test%26%23x2019%3B%2C%20%3Ci%3EInternational%20Statistical%20Review%20%5C%2F%20Revue%20Internationale%20de%20Statistique%3C%5C%2Fi%3E%2C%2051.1%20%281983%29%2C%20p.%2059%2C%20%3Ca%20class%3D%27zp-DOIURL%27%20href%3D%27http%3A%5C%2F%5C%2Fdoi.org%5C%2F10.2307%5C%2F1402731%27%3Ehttp%3A%5C%2F%5C%2Fdoi.org%5C%2F10.2307%5C%2F1402731%3C%5C%2Fa%3E%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22journalArticle%22%2C%22title%22%3A%22Karl%20Pearson%20and%20the%20Chi-Squared%20Test%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22R.%20L.%22%2C%22lastName%22%3A%22Plackett%22%7D%5D%2C%22abstractNote%22%3A%22%22%2C%22date%22%3A%2204%5C%2F1983%22%2C%22language%22%3A%22%22%2C%22DOI%22%3A%2210.2307%5C%2F1402731%22%2C%22ISSN%22%3A%2203067734%22%2C%22url%22%3A%22https%3A%5C%2F%5C%2Fwww.jstor.org%5C%2Fstable%5C%2F1402731%3Forigin%3Dcrossref%22%2C%22collections%22%3A%5B%22NG7P7RZR%22%5D%2C%22dateModified%22%3A%222020-10-13T05%3A30%3A30Z%22%7D%7D%2C%7B%22key%22%3A%22Q82W2B8N%22%2C%22library%22%3A%7B%22id%22%3A2241481%7D%2C%22meta%22%3A%7B%22lastModifiedByUser%22%3A%7B%22id%22%3A228821%2C%22username%22%3A%22christof.s%22%2C%22name%22%3A%22Christof%20Sch%5Cu00f6ch%22%2C%22links%22%3A%7B%22alternate%22%3A%7B%22href%22%3A%22https%3A%5C%2F%5C%2Fwww.zotero.org%5C%2Fchristof.s%22%2C%22type%22%3A%22text%5C%2Fhtml%22%7D%7D%7D%2C%22creatorSummary%22%3A%22Brinegar%22%2C%22parsedDate%22%3A%221963%22%2C%22numChildren%22%3A0%7D%2C%22bib%22%3A%22%3Cdiv%20class%3D%5C%22csl-bib-body%5C%22%20style%3D%5C%22line-height%3A%201.35%3B%20padding-left%3A%201em%3B%20text-indent%3A-1em%3B%5C%22%3E%5Cn%20%20%3Cdiv%20class%3D%5C%22csl-entry%5C%22%3EBrinegar%2C%20Claude%20S.%2C%20%26%23x2018%3BMark%20Twain%20and%20the%20Quintus%20Curtius%20Snodgrass%20Letters%3A%20A%20Statistical%20Test%20of%20Authorship%26%23x2019%3B%2C%20%3Ci%3EJournal%20of%20the%20American%20Statistical%20Association%3C%5C%2Fi%3E%2C%2058.301%20%281963%29%2C%20pp.%2085%26%23x2013%3B96%2C%20%3Ca%20class%3D%27zp-DOIURL%27%20href%3D%27http%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1080%5C%2F01621459.1963.10500834%27%3Ehttp%3A%5C%2F%5C%2Fdoi.org%5C%2F10.1080%5C%2F01621459.1963.10500834%3C%5C%2Fa%3E%3C%5C%2Fdiv%3E%5Cn%3C%5C%2Fdiv%3E%22%2C%22data%22%3A%7B%22itemType%22%3A%22journalArticle%22%2C%22title%22%3A%22Mark%20Twain%20and%20the%20Quintus%20Curtius%20Snodgrass%20Letters%3A%20A%20Statistical%20Test%20of%20Authorship%22%2C%22creators%22%3A%5B%7B%22creatorType%22%3A%22author%22%2C%22firstName%22%3A%22Claude%20S.%22%2C%22lastName%22%3A%22Brinegar%22%7D%5D%2C%22abstractNote%22%3A%22Mark%20Twain%20is%20widely%20credited%20with%20the%20authorship%20of%2010%20letters%20published%20in%201861%20in%20the%20New%20Orleans%20Daily%20Crescent.%20The%20adventures%20described%20in%20these%20letters%2C%20which%20were%20signed%20%5Cu201cQuintus%20Curtius%20Snodgrass%2C%5Cu201d%20provide%20the%20historical%20basis%20of%20a%20main%20part%20of%20Twain%27s%20presumed%20role%20in%20the%20Civil%20War.%20This%20study%20applies%20an%20old%2C%20though%20little%20used%20statistical%20test%20of%20authorship%5Cu2014a%20word-length%20frequency%20test%5Cu2014to%20show%20that%20Twain%20almost%20certainly%20did%20not%20write%20these%2010%20letters.%20The%20statistical%20analysis%20includes%20a%20visual%20comparison%20of%20several%20word-length%20frequency%20distributions%20and%20applications%20of%20the%20%5Cu03c72%20and%20two-sample%20t%20tests.%22%2C%22date%22%3A%2203%5C%2F1963%22%2C%22language%22%3A%22en%22%2C%22DOI%22%3A%2210.1080%5C%2F01621459.1963.10500834%22%2C%22ISSN%22%3A%220162-1459%2C%201537-274X%22%2C%22url%22%3A%22http%3A%5C%2F%5C%2Fwww.tandfonline.com%5C%2Fdoi%5C%2Fabs%5C%2F10.1080%5C%2F01621459.1963.10500834%22%2C%22collections%22%3A%5B%224MZ8ZP2B%22%2C%22NG7P7RZR%22%5D%2C%22dateModified%22%3A%222020-10-13T05%3A30%3A30Z%22%7D%7D%5D%7D

				

  McGillivray, Barbara, and Gábor Mihály Tóth, ‘Frequency’, in Applying Language Technology in Humanities Research: Design, Application, and the Underlying Logic, ed. by Barbara McGillivray and Gábor Mihály Tóth (Springer International Publishing, 2020), pp. 35–46, doi:10.1007/978-3-030-46493-6_3

				
				

  Froehlich, Heather, ‘Corpus Analysis with Antconc’, Programming Historian, 2015 <https://programminghistorian.org/en/lessons/corpus-analysis-with-antconc> [accessed 15 February 2021]

				
				

  Savoy, Jacques, ‘Comparative Evaluation of Term Selection Functions for Authorship Attribution’, Literary and Linguistic Computing, 30.2 (2015), pp. 246–61, http://doi.org/10.1093/llc/fqt047

				
				

  Gries, Stefan Th., ‘The Most Under-Used Statistical Method in Corpus Linguistics: Multi-Level (and Mixed-Effects) Models’, Corpora, 10.1 (2015), pp. 95–125, http://doi.org/10.3366/cor.2015.0068

				
				

  Bestgen, Yves, ‘Inadequacy of the Chi-Squared Test to Examine Vocabulary Differences between Corpora’, Literary and Linguistic Computing, 29.2 (2014), pp. 164–70, http://doi.org/10.1093/llc/fqt020

				
				

  Lijffijt, Jefrey, Terttu Nevalainen, Tanja Säily, Panagiotis Papapetrou, Kai Puolamäki, and Heikki Mannila, ‘Significance Testing of Word Frequencies in Corpora’, Digital Scholarship in the Humanities, 31.2 (2014), pp. 374–97, http://doi.org/10.1093/llc/fqu064

				
				

  Parsons, Kathryn, Agata McCormac, and Marcus Butavicius, Human Dimensions of Corpora Comparison: An Analysis of Kilgarriff’s (2001) Approach (DEFENCE SCIENCE AND TECHNOLOGY ORGANISATION EDINBURGH (AUSTRALIA) COMMAND CONTROL COMMUNICATIONS AND INTELLIGENCE DIV, April 2009) <https://apps.dtic.mil/docs/citations/ADA506585> [accessed 17 September 2019]

				
				

  Lüdeling, Anke, and Merja Kytö, eds., ‘Statistical Methods for Corpus Exploitation’, in Handbooks of Linguistics and Communication Science (Mouton de Gruyter, 2009), doi:10.1515/9783110213881.2.777

				
				

  Oakes, Michael P., and Malcolm Farrow, ‘Use of the Chi-Squared Test to Examine Vocabulary Differences in English Language Corpora Representing Seven Different Countries’, Literary and Linguistic Computing, 22.1 (2007), pp. 85–99, http://doi.org/10.1093/llc/fql044

				
				

  Anthony, Laurence, ‘AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom’, 2005, pp. 729–37, http://doi.org/10.1109/IPCC.2005.1494244

				
				

  Lancaster, H. O., and E. Seneta, ‘Chi-Square Distribution’, in Encyclopedia of Biostatistics, ed. by Peter Armitage and Theodore Colton (John Wiley & Sons, Ltd, 2005), p. b2a15018, doi:10.1002/0470011815.b2a15018

				
				

  Rayson, Paul, ‘Wmatrix: A Web-Based Corpus Processing Environment.’ (Computing Department, Lancaster University, 2005)

				
				

  Gabriela Cavaglià, ‘Measuring Corpus Homogeneity Using a Range of Measures for Inter-Document Distance Measuring Corpus Homogeneity Using a Range of Measures for Inter-Document Distance | Request PDF’, ResearchGate, 2002 <https://www.researchgate.net/publication/267784878_ITRI-02-08_Measuring_corpus_homogeneity_using_a_range_of_measures_for_inter-document_distance_Measuring_corpus_homogeneity_using_a_range_of_measures_for_inter-document_distance> [accessed 17 September 2019]

				
				

  Kilgarriff, Adam, ‘Comparing Corpora’, International Journal of Corpus Linguistics, 6.1 (2001), pp. 97–133, http://doi.org/10.1075/ijcl.6.1.05kil

				
				

  Scott, Mike, ‘PC Analysis of Key Words and Key Key Words’, System, 25.2 (1997), pp. 233–45, http://doi.org/10.1016/S0346-251X(97)00011-0

				
				

  Kilgarriff, Adam, ‘Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora’, in Fifth Workshop on Very Large Corpora, 1997 <https://www.aclweb.org/anthology/W97-0122> [accessed 6 September 2019]

				
				

  Kilgarriff, Adam, ‘Using Word Frequency Lists to Measure Corpus Homogeneity and Similarity between Corpora’, in Fifth Workshop on Very Large Corpora, 1997 <https://www.aclweb.org/anthology/W97-0122> [accessed 12 June 2020]

				
				

  Cressie, Noel A. C., and Timothy R. C. Read, ‘Pearsons-X2 and the Loglikelihood Ratio Statistic-G2: A Comparative Review’, 1989, http://doi.org/10.2307/1403582

				
				

  Plackett, R. L., ‘Karl Pearson and the Chi-Squared Test’, International Statistical Review / Revue Internationale de Statistique, 51.1 (1983), p. 59, http://doi.org/10.2307/1402731

				
				

  Brinegar, Claude S., ‘Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship’, Journal of the American Statistical Association, 58.301 (1963), pp. 85–96, http://doi.org/10.1080/01621459.1963.10500834