Thursday, December 23, 2010

Corpus-based computational linguistics: A practical investigation of the procedures involved in the selection, study and exploitation of a relevant corpus

  • ISSN 1442-438X
  • CALL-EJ Online
  • Vol. 3, No. 2, January 2002


  • Sean Romano Maddalena
  • srm@gol.com
  • University of Ashiya, Japan

Abstract

This paper charts a corpus analysis research investigation which was conducted in response to a classroom question. The linguistic features under investigation are “used to” and “be used to”; two grammatical forms whose constructional similarity often causes problems for beginner-level students.
This intentionally limited study outlines, by way of a step-by-step approach, the practical procedures involved in the assimilation and manipulation of computer-generated data. It is hoped that novice investigators may gain some valuable insight as to what even simplistic inquiries can bring for themselves as linguistic theorists, and to their learners embarking on a greater understanding of language meaning and usage.

A Brief History of Corpus Linguistics

Studies of language can be divided into two main areas: studies of structure and studies of use. Corpus analysis (CA) focuses on the second of these, studying actual language used in naturally occurring texts. Ever since Firth (1957) stated that “You shall know a word by the company it keeps”, it has been a practice in linguistics to classify words not only on the basis of their meanings, but also on the basis of their co-occurrence with other words. However, in a purely practical sense it is only in recent times that machines have given us the ability to identify these relationships in a meaningful and significant way.
From the simple listing of words in the Middle Ages by hand, to the earliest corpus-based analyses of literary styles, through to the first modern electronically readable corpus, the Brown University Corpus of American English, (and its close cousins the Lancaster-Oslo/Bergen corpus and the Kolhapur Corpus), the computer-aided analysis of vast amounts of authentic data has come a long way in a very short time. Almost half a century ago Firth (1957: 31) made the following prophetic statement: “The use of machines in linguistic analysis is now established”. John Sinclair (1991: 1) describes the evolution through the last three decades in the following way: “ Thirty years ago when this research started it was considered impossible to process texts of several million words in length. Twenty years ago it was considered marginally possible but lunatic. Ten years ago it was considered quite possible but still lunatic. Today it is very popular”. This popularity has led to an increased understanding of the relationship of meaning to form as formal patterns, previously undetected, have come to light. Sinclair states again, “At the very least, the quality of linguistic evidence is going to be improved out of all recognitionq¥Êt is my belief that a new understanding of the nature and structure of language will shortly be available as a result of the examination by computer of large collections of texts” (1991b: 489). Stubbs (1996) concurs, “computer-assisted analysis of texts and corpora can provide new understanding of form-meaning relations”.
It should be noted that CA involves far more than using computers for the simple counting and quantifying of linguistic features into sets of statistics. Though this may be seen as the first step in a two-stage process, it is the subsequent, qualitative analysis that provides the more revealing evidence “to propose functional interpretations explaining why the patterns exist” (Biber, Conrad & Reppen, 1998: 9). As a practical investigation, however, this paper focuses primarily on the procedures involved in obtaining and manipulating the data required to create a corpus, and while it does present some insight into possible pedagogic considerations and offer tentative conclusions based on corpus generated evidence, its scope is intentionally, limited.

Choosing a Corpus

Source, size and selection
In response to a recent classroom inquiry, the linguistic features under investigation are “used to” and “be used to”; two grammatical forms whose constructional similarity often causes problems for beginner-level students. For the purposes of this investigation I chose to use two established corpora, the Lancaster-Oslo/Bergen Corpus (LOB), of British English established by Geoffrey Leech and Jan Svartvik, and its American counterpart, the Brown University Corpus of American English (Brown), running parallel investigations under different methodological conditions. The two corpora are very similar in design: each taken from a total of some five hundred texts across a wide-range of registers, a combined total of approximately two million words.
Size is a prime concern for successful corpus-based lexicographic research. As Biber et al. warn: “To study the meaning and use of words, we need a very large corpus — a 1-million word corpus will not provide sufficient data for many words to allow for meaningful generalizations” (1998: 30). However, with more common words in a text of this size, frequencies are generally considered to be quite reliable. At a million or so words each, I was hoping that my choice of general purpose corpora would provide enough evidence to sufficiently highlight linguistic elements for possible future pedagogic exploitation.

Methodology

As primarily a practical research study, I chose to conduct this investigation employing a number of differing methods. In the first instance, I examined the LOB corpus using a CD-ROM provided by the International Computer Archive of Modern English (ICAME), running the analysis through a software application, the Aston Text Analyser (ATA), supplied by Aston University. I also used part of the LOB corpus to examine the practical problems one might encounter in the creation of a pedagogic corpus, established corpora not always being readily available for investigation and exploitation.
As a reflection of recent advances in Internet technology, I was also interested in conducting a limited parallel study, making use of an on-line version of the Brown corpus, a free but time-restricted service provided by the University of Pennsylvania's Linguistic Data Consortium, (LDC). Details of distribution and copyright restrictions pertaining to both texts are included, (Appendix C).
It should be noted here that although the Brown corpus is also supplied on the ICAME CD-ROM, I chose not to access it in the traditional way preferring instead to examine the benefits and shortcomings of locating and accessing corpora via the alternative, and increasingly popular, on-line method.

Equipment Used

The study was conducted with the aid of a generic desktop personal computer running the Windows operating system. Software support was provided by the WinATA Mark 2 text analyser, a word processor, MS-Word 97 and an Optical Character Recognition (OCR) program, Caere Omni-Page Pro 9.0 used in conjunction with a flatbed scanner.

Data Input: Scanning and OCR

Equipment and procedure

In some instances, teachers and researchers may not have access to established corpora due to resource limitations. In other cases, most notably for investigations in English for Specific Purposes (ESP), it might be necessary to manually create a specific pedagogic corpus. In creating such a corpus for use in CA, one possible means of inputting data is to scan text directly into a computer using a suitable combination of hardware and software. In order to explore the limitations of such a procedure, I used a Microtek ScanMaker X6 scanner, a low budget flatbed model, together with Caere Omni-Page Pro 9.0 OCR software, which was supplied as part of the scanner package.
For the limited purposes of this exercise, I first selected a section of some five hundred words from my LOB corpus, cut and pasted them into a new document and saved this as a separate text file. This was then printed onto a standard sheet of A4 paper, and then scanned directly into the computer. An almost flawless text conversion is testimony to the development of OCR software in recent times. A few years ago a similar exercise may well have resulted in a bout of severe frustration, even when scanning a simple page of text. These days, more advanced programs such as Omni-Page Pro offer much greater speed, reliability and flexibility, especially when integrated into established word processing applications such as Word and Word Perfect. Carefully scanned pages of text assimilated in this way can form the basis for a ‘personal’ pedagogic corpus, to be subsequently examined by a suitable text analysis program.

Some Points to Note

There are two significant considerations that can effect the quality of the final output from the scanning procedure. Firstly, and most importantly, is the quality and condition of the document that one wishes to scan. I was using a clearly printed black text on a clean sheet of white plain paper. Highly colored, glossy, marked or even creased papers have all been known to cause problems with OCR software. The second consideration relates to the complexity of the document. As my inquiry revealed, regular text is not really a problem for this kind of application. However, when one mixes text, graphics and tables, more time needs to be spent in the setup process before attempting the conversion. I also found in this exercise that the software occasionally flagged correct words simply because they were not in the dictionary it was using.

LOB and ATA

Installation

Installation of the ATA software suite is via CD-ROM. It is important to note during the installation process that in order for the software to function correctly, all files must be extracted into the same location and not into separate folders. Correct installation creates two executable programs; ataIndex and ataInsight which must be run separately, one after the other. The first of these, as the name suggests, creates and indexes the corpus. In the case of LOB, this entails specifying the correct path for the location of the text to be indexed and titling the project appropriately. When the indexing has been completed, it is then necessary to run the second application, ataInsight. This opens an ‘Open ATA project’ window in which the now indexed LOB text can be found. On selecting ‘OK’, the program starts its analysis of the chosen project.

Frequency and filter

My investigation is to specifically look for occurrences of “used to” within the corpus. To do this, it is first necessary to locate “used” from the ‘Word Frequency List’ which opens automatically on the left side of the screen. Selecting this entry, (with ‘Collocations’ checked in the right-button mouse menu) creates a list of contexts in a right-hand window; some 181 entries in total.
Next, it is desirable to refine a little further using the collocation‘Filter’option, reducing the list to those lines containing my chosen sub-string. Adding “to_” to the filter generates a final list of 178 concordances which contain only my target search string, “used to”. By selecting ‘Export’ from the right-button mouse menu, concordances can then be exported with relative ease from within the application and opened in a word processor, ready for tabulation, (Appendix A). From a total of 1,022,828 tokens, the following frequency list is generated. Relative frequencies are out of 10,000:

Fig.1 LOB Corpus frequencies for “to”, “used” and “used to”.

Observations

Presentation, an important consideration not merely for aesthetic purposes, also demands a practical working knowledge of basic word processing operations. Ideally for beginner-level students, concordances are presented in a clear and easy to read tabular format, sorted alphabetically to enable the swift identification of collocation patterns, (Appendix A and Appendix B).

Brown Corpus

As mentioned above, the Brown corpus is accessed through the University of Pennsylvania's LDC internet site. It offers a selection of corpora for real-time analyses though access, as a ‘guest user’ is restricted to twenty days. On acceptance of the user terms and conditions, one is invited to enter the relevant search criteria in a series of selectable fields.
An initial search returns a tagged frequency list, and generates concordances for the identified search pattern. The complete list of Brown concordances is provided in their processed form, (Appendix B).
From a total of 1,189,209 tokens, the following frequency list is generated. Once again, relative frequencies are calculated out of 10,000:

Fig.2 Brown Corpus frequencies for “to”, “used” and “used to”.

Observations

Established corpora are often the culmination of a great deal of time, effort and, most significantly, money. Such investment is jealously guarded and may not, therefore, be made generally available without due considerations of costs. In some cases this may prove to be prohibitive to the less fortuitous researcher. In this light, it can be seen that the ability to access a large on-line corpus in real-time is extremely useful for those unable to avail themselves of the more traditional resources, and also appealing to those who lack the practical wherewithal necessary for the successful exploitation of a complicated text analysis program. Such corpora also offer the added benefit of speed; a list of concordances can be generated in a matter of seconds. However, at this early stage of development the on-line corpus does not yet offer the flexibility or power of a dedicated software package, such as ATA, to sort or to filter, as need dictates.

Analysis

The majority of the concordances in LOB are taken up with “used to” employed to describe past situations and events. There is a visible tendency within the list to collocate with the verb “to be” and also with other common verbs:


    • as fresh as it used to be, though an
    • you herself what she used to be.
    • But then I used to be a racing
    • reading ," wrote Francis Williams,” used to be a Socialist

    The corpus provides twenty-eight instances of “be used to” meaning to be “accustomed to”. The propensity is for the item to collocate with a noun or a verb, notably the gerund. Of the total number, only eleven actually occur with the gerund, which is the collocate most commonly highlighted in beginner-level textbooks. Textbooks also tend to focus on the gerund occurring after the target form:
    • time before I got used to calling them portholes.
    • Clara was used to following his lead
    • seemed to have been used to seeing couples engaged

    whereas LOB offers examples of the gerund occupying a position before the target form:
    • a bit of getting used to
    • She took time getting used to the indoor lavatories

    And a single instance of a noun coming between the two:
    • garage, but he was used to Grant taking his

    A further significant observation is that more than half of the these concordances demonstrate collocations with the verb “get”:
    • You'll have to get used to my bad morning
    • heavy, but one got used to this

    Though not the focus of this particular exercise, the list also provides some examples of the target form performing a third linguistic function, the passive voice:
    • descriptions can also be used to refer to performances
    • ratio decidendi}is normally used to refer to some
    • beggars, a term often used to describe the population,
    • ferromagnetic spinel is sometimes used to describe those ferrites

    With Brown, as with LOB above, “used to” describing past events tends to collocate with the verb “to be” and other common verbs:
    • eem high, but they used to be even higher,
    • spe said, This soil used to be like that
    • ard roll. <s> This used to be part of

    Also present, as noted in LOB, are instances of “used to” employed in the passive voice:
    • ma. The method used to scan the eye
    • I rand, IOCSIXG, is used to specify the second

    The Brown corpus offers twelve examples of “used to” meaning to be “accustomed to”; less than half of the total number present in LOB. Of these, only five collocate with the gerund:
    • ke a little getting used to — not because it
    • ur people have been used to accepting things as
    • that must have been used to booming, `` and th
    • he governor was not used to having his integrit
    • jealous. <s> He's, used to me bringing home

      and only twoof the twelve co-occur with the verb “get”:
    • ke a little getting used to — not because it
    • little time to get, used to. After a

    Possible Pedagogic Applications

    In the classroom, concordances produced through the analysis of a suitable corpus can provide valuable data for the testing of existing grammatical models and practical material for the production of cloze exercises. Closer examination can also reveal patterns and constructions that may not be covered in prescribed textbooks.
    The initial intent of this study was to examine the differences in usage between “used to” and “be used to”. My learners do not have a significant problem with the former, but do express confusion when attempting to differentiate it from the latter. My institution's current choice of text only instructs in the use of “be used to” co-occurring with the gerund and, consequently, my students have only been exposed to this construction in their English classes. However, the majority of these concordances in Brown and LOB occur with no gerund at all, a point worthy of highlighting in the classroom. Though different in meaning, the number of cases of “get used to” provided by the corpora, most prominently LOB, may be seen as noteworthy and also deserving of my students' attention, as this particular construction is not covered in the students' textbook at all. A practical pedagogic approach to both of these issues would be to expose my students to the corpus-generated data as part of a series of carefully coordinated lessons. Through the insights I have gained in the course of this particular study, my eventual aim would be to bring CA directly into the classroom, possibly as part of the school's regular computer studies classes, and allow my students to join the investigation as part of a hands-on practical exercise.
    However, to add a note of caution, as my own small investigation reveals, there are significant differences in both frequency and usage to be found even across two very ‘similar’ corpora. It is important therefore to make only tentative inferences regarding grammatical rules or patterns of use and to acknowledge the limitations of dealing with such a small sample of data. A future piece of research conducted on a much larger text might allow for some more definite conclusions to be made.
    A further possible pedagogic option, requiring an extension of this study, would be to heed the advice of Willis & Willis (1996) and Peacock (1997: 152) to produce a set of authentic materials: “materials which are used in genuine communication in the real world” (Wong, Kwok & Choi, 1995: 318), taken from a spoken, rather than written, corpus and to investigate specifically any increased signs of motivation with my less-conscientious learners.
    It is perhaps a fitting conclusion to note that in the course of writing this paper a further development in the evolution of computational linguistics and the internet is reported: ICAME is now the latest in a growing number of institutions offering on-line access to all of its corpora, in this case to registered users of its commercially available CD-ROM. It seems likely that such innovations, offering increased levels of accessibility to an ever-growing body of linguistic data, will continue into the foreseeable future.

    References

    • Biber, D., Conrad, S., & Reppen, R. (1998).CORPUS LINGUISTICS: Investigating Language Structure and Use. Cambridge: Cambridge University Press.
    • Brown University Corpus of American English.
    • University of Pennsylvania, Linguistic Data Consortium: http://www.ldc.upenn.edu/
    • Firth, J. R. (1957). A synopsis of linguistic theory. Studies in linguistic analysis. Oxford: Oxford University Press.
    • Lancaster-Oslo/Bergen corpus (1961). International Computer Archive of Modern English. Bergen, Norway.
    • Peacock, M. (1997). The effect of authentic materials on the motivation of EFL learners. ELTJ, 51(2), 144-156.
    • Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.
    • Stubbs, M. (1996). Text and Corpus Analysis. London: Blackwell.
    • Willis, J. & Willis, D. (1996). Consciousness-raising activities. In Willis, D. & Willis, J. (Eds.), Challenge and Change in Language Teaching. London: Heinemann.
    • Wong, V., Kwok, P. & Choi, N. (1995). The use of authentic materials at tertiary level. ELTJ, 49(4), 318-322.

    Author

    Sean Maddalena holds a first degree in Law, a Diploma in TEFL and an MSc in TESOL from Aston University. Originally from the United Kingdom, he has made his home in Japan since 1989 and is currently employed by Ashiya University. His specific research interests include Course and Syllabus Design and Computational Linguistics.

    Appendix A


    a bit of gettingq_
    used toq_ .
    plane can only beq_
    used toq_ a limited extent
    of the man habituallyq_
    used toq_ a shoulder-holster .
    such computers can beq_
    used toq_ advantage when a
    Gissingq_
    used toq_ ask ~ * ' Has he
    affluent society should beq_
    used toq_ assist the less
    Rolled barley isq_
    used toq_ balance grass or
    as fresh as itq_
    used toq_ be , though an
    you herself what sheq_
    used toq_ be .
    man myself though : Iq_
    used toq_ be a { 0G.P . }
    But then Iq_
    used toq_ be a racing
    reading , " wrote Francis Williams , "q_
    used toq_ be a Socialist
    done by administrative actq_
    used toq_ be accomplished in
    the subject Social Psychologyq_
    used toq_ be called Home-making
    of the May songq_
    used toq_ be current in
    Itq_
    used toq_ be fancier , but
    At one time " mind * * "q_
    used toq_ be identified with "
    of their larger carsq_
    used toq_ be made available
    her hair , it neverq_
    used toq_ be quite that
    This lessonq_
    used toq_ be read only
    Sometimes that pleasant Citroenq_
    used toq_ be subject to
    Harry of the jointq_
    used toq_ be the barman
    Thereq_
    used toq_ be three separate
    I was younger Iq_
    used toq_ be what is
    Like heq_
    used toq_ be years ago . . .
    three feet long butq_
    used toq_ being handled , in
    of the gold filletsq_
    used toq_ bind up the pŽº/span>
    Miniature cedar trees areq_
    used toq_ block out the
    technical school ) should beq_
    used toq_ broaden the youngsters '
    British sources have beenq_
    used toq_ calculate the effective
    time before I gotq_
    used toq_ calling them portholes .
    I alwaysq_
    used toq_ clean my rifle
    Heq_
    used toq_ come every day
    Heq_
    used toq_ come to Pierre's
    remember a woman whoq_
    used toq_ come to see
    at Saintes , has beenq_
    used toq_ complete the drawing
    have been or areq_
    used toq_ control impurity build
    is what bedizened boysq_
    used toq_ dance before Mogul
    its phrases , especially thoseq_
    used toq_ describe a visit
    Kunst wasq_
    used toq_ describe certain branches
    with the conventional equationsq_
    used toq_ describe fluxes in
    unit , can be properlyq_
    used toq_ describe soils in
    a root that isq_
    used toq_ describe the herding
    as the wave functionsq_
    used toq_ describe the motion
    equation can indeed beq_
    used toq_ describe the motion .
    however , they may beq_
    used toq_ describe the motions
    beggars , a term oftenq_
    used toq_ describe the population ,
    ferromagnetic spinel is sometimesq_
    used toq_ describe those ferrites
    method of measurement wasq_
    used toq_ determine accurately the
    year group was thenq_
    used toq_ determine what would
    his Cambridge days , heq_
    used toq_ display a corresponding
    elaborate dresses than theyq_
    used toq_ do .
    Mould many years backq_
    used toq_ do .
    Peopleq_
    used toq_ do all their
    strain , the two beingq_
    used toq_ draw true stress /
    young the Royal Navyq_
    used toq_ drink it before
    Heq_
    used toq_ drink the cheap ,
    that report has beenq_
    used toq_ estimate the theoretical
    diametrically opposed contacts wereq_
    used toq_ facilitate the observation
    gouge , and the fileq_
    used toq_ finish off .
    the former crop beingq_
    used toq_ finish off the
    Clara wasq_
    used toq_ following his lead ,
    The method wasq_
    used toq_ forecast visibility ( as
    concrete tube sections beingq_
    used toq_ form the sump
    smoothing plane can beq_
    used toq_ form the taper .
    Bank years ago weq_
    used toq_ get good hauls , 12
    song , told me : Weq_
    used toq_ get up at
    This solution may beq_
    used toq_ give the contribution
    those places where weq_
    used toq_ go .
    much as Cecil Sharpq_
    used toq_ go about in
    Sheq_
    used toq_ go about the
    garage , but he wasq_
    used toq_ Grant taking his
    Iq_
    used toq_ hate Creedy , when
    for a drink heq_
    used toq_ have his grouse .
    The Caxtonsq_
    used toq_ have their holidays
    told me " I alwaysq_
    used toq_ hear a lot
    Weq_
    used toq_ hear talk about
    took time to becomeq_
    used toq_ hearing so much
    household possessions may beq_
    used toq_ help with the
    Apparently heq_
    used toq_ hide it in
    they may be fruitfullyq_
    used toq_ His Glory .
    and these can beq_
    used toq_ illustrate the type
    overclothe them as theyq_
    used toq_ in the old
    The term quasi-classical isq_
    used toq_ indicate that their
    growth equilibrium " paths , areq_
    used toq_ investigate the stability
    man , if you aren'tq_
    used toq_ it , * * ' he heard
    You'll getq_
    used toq_ it , adorable baby .
    that we should getq_
    used toq_ it .
    I never gotq_
    used toq_ its travel-film colours
    Two methods can beq_
    used toq_ join the crochet
    differences between jobs beq_
    used toq_ justify differences in
    a young man , weq_
    used toq_ keep strictly to
    to meet people Iq_
    used toq_ know , to see
    electric effect can beq_
    used toq_ launch ultrasonic waves
    Iq_
    used toq_ lie awake planning
    a counter-irritant almost Iq_
    used toq_ listen of nights
    Marc Chagallq_
    used toq_ live here and
    Then that's why * - " " Heq_
    used toq_ live in Tangier , "
    Theyq_
    used toq_ look * - and some
    of an elephant , wasq_
    used toq_ make a cake
    Some separated lead-210 wasq_
    used toq_ make reference standards
    crochet lace can beq_
    used toq_ make tablecloths , traycloths
    provision which was nowq_
    used toq_ make the { 0T.E .
    ancient Britons , I believe ,q_
    used toq_ make water hot
    as it is nowq_
    used toq_ mark a paragraph
    Section the term wasq_
    used toq_ mean something like
    Georgeq_
    used toq_ mix 100 stone of
    junior to Humbert , whoq_
    used toq_ mock him affectionately
    You'll have to getq_
    used toq_ my bad morning
    gauge can now beq_
    used toq_ nick in the
    three following winters wereq_
    used toq_ obtain an independent
    Heq_
    used toq_ organise film shows
    which can then beq_
    used toq_ perform an operation .
    and devices to beq_
    used toq_ perform the various
    Iq_
    used toq_ play about in
    Iq_
    used toq_ play rugger , * * ' said
    lead carrier solution isq_
    used toq_ prepare the reference
    how Alexander the Greatq_
    used toq_ recline and transact
    descriptions can also beq_
    used toq_ refer to performances
    ratio decidendi } is normallyq_
    used toq_ refer to some
    it may have beenq_
    used toq_ relate Christ's healing
    migre * ? 2s , who notoriouslyq_
    used toq_ repair to the
    she said chattily , Iq_
    used toq_ ride a bicycle .
    and personality which journalistsq_
    used toq_ ridicule , can be
    the gate the cockerelq_
    used toq_ run to meet
    for you fellows , * * ' heq_
    used toq_ say , you can
    Laughable , theyq_
    used toq_ say .
    Heq_
    used toq_ say : ^ Have whatever
    Of Kitchener heq_
    used toq_ say with humorous
    reminiscent of what weq_
    used toq_ see pŽ®St .
    seemed to have beenq_
    used toq_ seeing couples engaged
    embarrassment if she isq_
    used toq_ seeing her mother
    that force should beq_
    used toq_ settle this problem .
    the May carol heq_
    used toq_ sing , with his
    me the one sheq_
    used toq_ sing in Kimbolton
    a shaped rubber isq_
    used toq_ smooth the hollow
    was young schoolboy I
    used toq_ sneak off to
    Sheq_
    used toq_ solve all the
    the clinical weekends heq_
    used toq_ spend with her .
    applied , and every meansq_
    used toq_ stop the train ,
    in contrasting tones wereq_
    used toq_ strengthen garments at
    model which may beq_
    used toq_ study both the
    Heq_
    used toq_ stump round the
    possibility of power beingq_
    used toq_ supplement hand tools .
    Iq_
    used toq_ take the small
    and colleague , Campbell Dixon ,q_
    used toq_ tell of a
    The straight-edge can beq_
    used toq_ test the straightness
    is bought , can beq_
    used toq_ the best advantage .
    at ( B ) . A malletq_
    used toq_ the chisel is
    become ( 1 ) tired , or ( 2 ) moreq_
    used toq_ the disturbance .
    Soho , to get meq_
    used toq_ the food , he
    might as well getq_
    used toq_ the idea .
    they very quickly getq_
    used toq_ the idea of
    She took time gettingq_
    used toq_ the indoor lavatories
    They'req_
    used toq_ the snatch racket .
    that most people getq_
    used toq_ them .
    Jane wasq_
    used toq_ these sudden exigencies
    or chieftain to getq_
    used toq_ these trimmings because
    to tinsel compliments , weq_
    used toq_ think him unworldly ,
    in an Embassy * - Iq_
    used toq_ think it was
    heavy , but one gotq_
    used toq_ this .
    You are not yetq_
    used toq_ this sort of
    decorative kale are convenientlyq_
    used toq_ tone in with
    horses ; they had beenq_
    used toq_ trains since they
    The brush contacts wereq_
    used toq_ trigger off a
    He oftenq_
    used toq_ try to imagine
    His friendsq_
    used toq_ try to persuade
    friend , William James , whoq_
    used toq_ urge that the
    in London that Jonesq_
    used toq_ use in the
    slaves * - everything he wasq_
    used toq_ using while he
    a literary province Iq_
    used toq_ visit fairly often ;
    Sheq_
    used toq_ walk straight to
    Heq_
    used toq_ walk to the
    page , would have beenq_
    used toq_ weigh bales of
    They could beq_
    used toq_ weigh several sacks
    its simplest form itq_
    used toq_ work in the
    they are a teamq_
    used toq_ working together , they
    like that she hadq_
    used toq_ write to me .

    Appendix B


    ke a little gettingq_q_
    used to -- not because it
    iling teasing as heq_q_
    used to . <p> <s> `` Huskyq_q_q_q_
    from it that sheq_q_
    used to . <p> <s> `` You
    little time to getq_q_
    used to . <s> After a
    ur people have beenq_q_
    used to accepting things as
    a new melody isq_q_
    used to accompany his narraq_q_q_
    repetitious The logical schemeq_q_
    used to accomplish the formq_q_q_q_
    residual hese inquiries wereq_q_
    used to adjust compilationsq_q_q_ tient
    uestions . <s>I 'mq_
    used to all three , but
    herse one hebephrenic manq_q_
    used to annoy me , month
    ageq_ seven-iron shot heq_q_
    used to approach the greenq_q_q_q_
    s> They could beq_q_
    used to attack a nation '
    platform and can beq_q_
    used to automatically holdq_q_q_q_ iling
    citiz--uglier than youq_q_
    used to be , and you
    ss glorious than itq_q_
    used to be , it is
    nistered here as itq_q_
    used to be , with unleaveneq_q_q_
    or less than itq_q_
    used to be ? ? <p> <s>
    eem high , but theyq_q_
    used to be even higher '' ,q_q_q_
    spe said , This soilq_q_
    used to be like that
    ard roll . <s> Thisq_q_
    used to be part of
    as e Catskills , whichq_q_
    used to be the summer
    that must have beenq_
    used to booming , `` and th
    ese profiles can beq_q_
    used to calculate a temperaq_q_q_
    feeli ransports that wereq_q_
    used to carry Communist ageq_q_q_q_
    the mails were thenq_q_
    used to carry it out '' . <q_q_q_q_
    tional codes can beq_q_
    used to challenge and countq_q_q_q_
    and d Margaret recall ,q_q_
    used to characterize her asq_q_q_ >
    les of crystals areq_q_
    used to classify and identiq_q_q_
    of materials can beq_q_
    used to construct a satisfaq_q_q_
    cattle of thousand spectatorsq_q_
    used to crowd it in
    holes and can beq_q_
    used to cut exact-size discq_q_q_
    the words he hadq_q_
    used to defend Cromwell . <q_q_q_ he
    grea emical methods wereq_q_
    used to demonstrate the renq_q_q_
    K factor , a termq_q_
    used to denote the rate
    s> Mines can beq_q_
    used to deny access to
    elastic resonance shifts isq_q_
    used to derive a general
    was a Spanish wordq_q_
    used to describe cattle ofq_q_q_q_
    s ,sometimes it isq_q_
    used to describe felt humanq_q_q_
    integritq_ ind words travelersq_q_
    used to describe Little Rocq_q_q_
    prbody temperature isq_q_
    used to describe the radiatq_q_q_
    e aircraft could beq_q_
    used to destroy other mobilq_q_q_
    ese sound waves areq_q_
    used to detect submarines ,q_q_q_ ma . <
    the the anonymous Womanq_q_
    used to do , and he
    each time as heq_q_
    used to do . <s> When
    second aerated lagoons beq_q_
    used to eliminate the problq_q_q_
    h tiles , marble areq_q_
    used to emphasize the feeliq_q_q_
    ve operation EQU isq_q_
    used to equate symbolic namq_q_q_
    d transom which wasq_q_
    used to fasten them to
    a satisfa lf-unloading wagonsq_q_
    used to fill silos spreadsq_q_q_q_
    tten 2 B filter wasq_q_
    used to filter off residualq_q_q_
    er last week ,Iq_q_
    used to follow Williams eveq_q_q_
    power which can beq_q_
    used to frustrate the citizq_q_q_q_ --
    atement may also beq_q_
    used to generate an RDW
    old days when `` weq_q_
    used to get the seamen
    af A hebephrenic manq_q_
    used to give a repetitiousq_q_q_q_q_
    was another . <s> Iq_q_
    used to go with Watson
    mulated that can beq_q_
    used to good advantage . <pq_q_q_
    eel lonely , and weq_q_
    used to hang a sign
    aps as the cave-menq_q_
    used to have in the
    he governor was notq_
    used to having his integrit
    and had already becomeq_
    used to Hesperus ' snappingq_q_q_ he
    eem strange to earsq_q_
    used to hillbilly and jazz
    and he was notq_q_
    used to horseback . <s> Now
    ngs Thorpe, can beq_q_
    used to illustrate anotherq_q_q_q_q_ power
    vocatio pleading cannot beq_q_
    used to impose unnecessaryq_q_q_q_ h
    nk together like weq_q_
    used to in the old
    the progr `` technology '' isq_q_
    used to include any and
    of time is merelyq_q_
    used to increase the realisq_q_q_
    mobil rrently , marina isq_q_
    used to indicate a municipaq_q_q_q_ **
    w seldom they did :q_q_
    used to it , probably . <s>
    n tactics have beenq_q_
    used to justify like tacticq_q_q_q_
    spreads Computers are beingq_q_
    used to keep branch inventoq_q_q_ <
    the new jail , weq_q_
    used to keep prisoners in
    ng cover , could beq_q_
    used to keep the wastes
    the eye . <s> Weq_q_
    used to kid him by
    ny ? ? <s> He neverq_q_
    used to like any hot
    cereal aining appliance isq_q_
    used to lock them in
    c. <s> The Presidentq_q_
    used to look at it
    by the same methodq_q_
    used to look up a
    ith , Styka . <s> Iq_q_
    used to love this country
    he coconut palm areq_q_
    used to make candles in
    as urposes -- also areq_q_
    used to make soaps , detergq_q_q_
    of public places thatq_q_
    used to make the Jew
    zon apabilities must beq_q_
    used to maximum advantage tq_q_q_q_ . <
    jealous . <s> He 'sq_q_
    used to me bringing home
    count mimesis '' is hereq_q_
    used to mean the recallingq_q_q_q_q_
    if it could beq_q_
    used to measure the elasticq_q_q_
    s> Sonar can beq_q_
    used to measure the thickneq_q_q_
    radiat ed thermocouple wasq_q_
    used to measure the upstreaq_q_q_
    aratus will also beq_q_
    used to measure transitionq_q_q_q_ ese
    s steel screws wereq_q_
    used to minimize corrosionq_q_q_q_ e
    The DA statement isq_q_
    used to name and define
    The DC statement isq_q_
    used to name and enter
    sample ; e bio-assay methodsq_q_
    used to obtain them . <s>
    tient of mine , whoq_q_
    used to often seclude herseq_q_q_
    s> yesterday .<s> Youq_q_
    used to paint in them ,
    ly state funds wereq_q_
    used to pay for the
    as a child Iq_q_
    used to play '' . <s> He
    he corner where youq_q_
    used to play when you
    very summer . <s> Iq_q_
    used to play with the
    out ''surpluses had beenq_q_
    used to provide a private
    ce forces have beenq_q_
    used to provide defense zonq_q_q_
    she ed aluminum plate ,q_q_
    used to provide the dryingq_q_q_q_
    asq_ Miss Giles alwaysq_q_
    used to refer to her
    most of what weq_q_
    used to regard as the
    ntic up there , sheq_q_
    used to say , with the
    of my ewish intellectualsq_q_
    used to say . <p> <s>
    se by instinct , heq_q_
    used to say : such places
    The party that wonq_q_
    used to say something aboutq_q_q_
    ma . <s> The methodq_q_
    used to scan the eye
    S statement must beq_q_
    used to select the major
    stem . <s> DIOCS isq_q_
    used to select the major
    b '' . <s> It isq_q_
    used to separate two or
    s> The symbol isq_q_
    used to separate two or
    foam and can beq_q_
    used to slit continuous sheq_q_q_
    me rand , IOCSIXF , isq_q_
    used to specify the first
    I rand , IOCSIXG , isq_q_
    used to specify the secondq_q_q_q_q_
    upstrea equency starter wasq_q_
    used to start the arc . <
    erb garden was alsoq_q_
    used to stop bleeding , andq_q_q_
    a lock ,which isq_q_
    used to store cumulative req_q_q_q_
    Throu was constructed andq_q_
    used to study transition prq_q_q_
    corrosion e been successfullyq_q_
    used to suggest ways to
    than to an Americanq_q_
    used to summers in New
    pirical data can beq_q_
    used to support whatever prq_q_q_
    sort of thing thatq_q_
    used to take place in
    e evening . <s> Sheq_q_
    used to tell me , `` When
    South nt this opportunityq_q_
    used to tell them about
    ygous Af cells wereq_q_
    used to test each sample ;q_q_q_q_
    invento <s> Where Americansq_q_
    used to think of a
    unt of a machine-familyq_q_
    used to this very day
    he enemy-Jew can beq_q_
    used to transform the ordinq_q_q_
    i er , Model 565 , isq_q_
    used to transport the boatq_q_q_q_
    was a trick theyq_q_
    used to try and conceal
    San Juan , but Iq_q_
    used to work on a

    Appendix C

    Copyrights and distribution:

    LOB Corpus:

    The corpus and accompanying manual are available at cost to bona fide researchers through the International Computer Archive of Modern English (ICAME), at the Norwegian Computing Centre for the Humanities, Bergen, Norway.
    The following restrictions on the use of the material must be strictly observed:
    • No copies of the corpus, or parts of the corpus, are to be distributed under any circumstances without the written permission of ICAME.
    • Print-outs of the corpus, or parts thereof, are to be used for bona fide research of a non-profit nature. Holders of copies of the corpus may not reproduce any texts, or parts of texts, for any purpose other than scholarly research without obtaining the written permission of the individual copyright holders, as listed in the manual ccompanying the corpus.
    • Commercial publishers and other non-academic organizations wishing to make use of part or all of the corpus or a print-out thereof must obtain permission from all the individual copyright holders involved.

    Brown Corpus:

    The Linguistic Data Consortium grants to you a license to use this data subject to the following understandings, terms and conditions:
    1. Permitted Uses.
      • This data may only be used for linguistic research.
      • Small excerpts of text or audio data from LDC-Online materials may be displayed to others or published in a scientific or technical context, solely for the purpose of describing the research and related issues. Statistics and other summaries of LDC-Online materials may also be published in the same context. Except for such publication of small excerpts or statistical summaries in scientific or technical works, neither LDC-Online materials themselves, nor access to them, may be sold or transferred to others.
    2. Access by Individuals.
      • To access this data, you must be a staff member, consultant, or individual providing service or doing research at an organization that is a member of the LDC, and you must agree to this user agreement and its provisions. You must terminate your access when these conditions no longer apply.
    3. Copyright.
      • Except as specifically permitted above the display, reproduction, transmission, distribution or publication of the these databases is prohibited.
      • Violations of the copyright restrictions on the data may result in legal liability.
    source: http://callej.org/


    No comments:

    Post a Comment