Discussion:
[basex-talk] Full-text lemmatizing and xml:lang
Kristian Kankainen
2017-06-27 14:57:09 UTC
Permalink
Hello

I have documents with text in several languages. When creating a
database in BaseX I can choose *one* language for stemming for the
full-text search index. Is there a way BaseX could lemmatize according
to the elements xml:lang attribute?

Best regards
Kristian K
Xavier-Laurent SALVADOR
2017-06-27 15:19:38 UTC
Permalink
Hi Kristian,

This is useful for creating automatically databases according to xml:lang
attribute

let $dir := '/Users/me/myDesktop/'
for $file in file:list($dir)[matches(.,'xml')]
return
let $flag := (data(doc($dir||$file)/div/@xml:lang))
return
db:create("DB", $dir||$file, (), map { 'ftindex':
true(),'language':$flag })

Or you can "ft:tokenize" your string mapping {'language':$flag} into your
query

Hope I understood the problem :) Else return 'sorry'
Hello
I have documents with text in several languages. When creating a database
in BaseX I can choose *one* language for stemming for the full-text search
index. Is there a way BaseX could lemmatize according to the elements
xml:lang attribute?
Best regards
Kristian K
--
Ce message peut contenir des informations réservées exclusivement à son
destinataire. Toute diffusion sans autorisation est interdite. Si vous
n'en êtes pas le destinataire, merci de prendre contact avec l'expéditeur
et de détruire ce message.

*This email may contain material for the sole use of the intended
recipient. Any forwarding without express permission is prohibited. If you
are not the intended recipient, please contact the sender and delete all
copies*.
Christian Grün
2017-06-27 18:49:05 UTC
Permalink
Hi Kristian,

It is currently not possible to work with different languages in a
single database. This is mostly because all normalized tokens will end
up in the same internal index, and it would be a lot of effort to
diversify this software behavior.

As Xavier pointed out (thanks!), the best way indeed is to create
different databases, one per language. The following example has been
inspired by Xavier’s proposal; it groups all files by their language
and adopts the language in the name of the database:

for $path-group in file:children('input-dir')
where ends-with($path-group, '.xml')
group by $lang := ($path-group//@xml:lang)[1]
return db:create(
'db-' || $lang,
$path-group,
(),
map { 'ftindex': true(), 'language': $lang }
)

Hope this helps,
Christian




On Tue, Jun 27, 2017 at 5:19 PM, Xavier-Laurent SALVADOR
Post by Xavier-Laurent SALVADOR
Hi Kristian,
This is useful for creating automatically databases according to xml:lang
attribute
let $dir := '/Users/me/myDesktop/'
for $file in file:list($dir)[matches(.,'xml')]
return
return
true(),'language':$flag })
Or you can "ft:tokenize" your string mapping {'language':$flag} into your
query
Hope I understood the problem :) Else return 'sorry'
Hello
I have documents with text in several languages. When creating a database
in BaseX I can choose *one* language for stemming for the full-text search
index. Is there a way BaseX could lemmatize according to the elements
xml:lang attribute?
Best regards
Kristian K
--
Ce message peut contenir des informations réservées exclusivement à son
destinataire. Toute diffusion sans autorisation est interdite. Si vous n'en
êtes pas le destinataire, merci de prendre contact avec l'expéditeur et de
détruire ce message.
This email may contain material for the sole use of the intended recipient.
Any forwarding without express permission is prohibited. If you are not the
intended recipient, please contact the sender and delete all copies.
Xavier-Laurent SALVADOR
2017-06-28 06:45:15 UTC
Permalink
Hi,

After reading Christian answer ( :-) ); I thought it could be interesting
to sort your docs according to @xml:lang and create a new DB next to your
corpus :

----------------------------------
distinct-values(
file:children('input-dir')[matches(.,'xml$')] ! (doc(.)//@xml:lang)
)
!
db:create(
'db-' || .,
<root xml:lang="{.}">
{
for $file in file:children('/Users/xavier/Desktop/')[matches(.,'xml$')]
return
<text src='{$file}'>{doc($file)//*[@xml:lang=.]//text()}</text>
}
</root>,
"myfile",
map { 'ftindex': true(), 'language': . }
)
----------------------------------
Post by Xavier-Laurent SALVADOR
Hi Kristian,
It is currently not possible to work with different languages in a
single database. This is mostly because all normalized tokens will end
up in the same internal index, and it would be a lot of effort to
diversify this software behavior.
As Xavier pointed out (thanks!), the best way indeed is to create
different databases, one per language. The following example has been
inspired by Xavier’s proposal; it groups all files by their language
for $path-group in file:children('input-dir')
where ends-with($path-group, '.xml')
return db:create(
'db-' || $lang,
$path-group,
(),
map { 'ftindex': true(), 'language': $lang }
)
Hope this helps,
Christian
On Tue, Jun 27, 2017 at 5:19 PM, Xavier-Laurent SALVADOR
Post by Xavier-Laurent SALVADOR
Hi Kristian,
This is useful for creating automatically databases according to xml:lang
attribute
let $dir := '/Users/me/myDesktop/'
for $file in file:list($dir)[matches(.,'xml')]
return
return
true(),'language':$flag })
Or you can "ft:tokenize" your string mapping {'language':$flag} into your
query
Hope I understood the problem :) Else return 'sorry'
Post by Kristian Kankainen
Hello
I have documents with text in several languages. When creating a
database
Post by Xavier-Laurent SALVADOR
Post by Kristian Kankainen
in BaseX I can choose *one* language for stemming for the full-text
search
Post by Xavier-Laurent SALVADOR
Post by Kristian Kankainen
index. Is there a way BaseX could lemmatize according to the elements
xml:lang attribute?
Best regards
Kristian K
--
Ce message peut contenir des informations réservées exclusivement à son
destinataire. Toute diffusion sans autorisation est interdite. Si vous
n'en
Post by Xavier-Laurent SALVADOR
êtes pas le destinataire, merci de prendre contact avec l'expéditeur et
de
Post by Xavier-Laurent SALVADOR
détruire ce message.
This email may contain material for the sole use of the intended
recipient.
Post by Xavier-Laurent SALVADOR
Any forwarding without express permission is prohibited. If you are not
the
Post by Xavier-Laurent SALVADOR
intended recipient, please contact the sender and delete all copies.
--
Ce message peut contenir des informations réservées exclusivement à son
destinataire. Toute diffusion sans autorisation est interdite. Si vous
n'en êtes pas le destinataire, merci de prendre contact avec l'expéditeur
et de détruire ce message.

*This email may contain material for the sole use of the intended
recipient. Any forwarding without express permission is prohibited. If you
are not the intended recipient, please contact the sender and delete all
copies*.
Kristian Kankainen
2017-06-30 21:26:56 UTC
Permalink
Hello

Sorry for being slow in reception, being a full-time father of two kids
is my only excuse.

Thank you for enlightening answers. At first creating a separate
database felt wrong and stupid, but after a while it felt just right and
helping to organize different language elements via aggregation instead
of composition.

Here is what I came up with:

(:~
This function takes a list of database names and optionally a list of
language codes.
It creates separate full-text indexed databases for lemmatized searching
of each language contained in the original database.
If the list of language codes is empty, all existing values of xml:lang
found in the database is used.
The full-text databases are named 'dbname-ft-langcode'
Another function normalizes the texts, removes duplicate entries and
inserts xml:id attributes
:)
declare updating function keeleleek:create-ft-indices-for-each-lang(
$db-names as xs:string*,
$lang-codes as xs:string*
) {
for $db-name in $db-names
let $langs := if( not( empty( $lang-codes )))
then( $lang-codes )
else(
distinct-values(db:open($db-name)//@xml:lang) )
for $lang in $langs
let $lang-group := db:open($db-name)//*[@xml:lang = $lang]
let $ft-db-name := concat($db-name, '-ft-', $lang)

(: create full-text db for each language :)
return
db:create(
$ft-db-name,
<texts>{$lang-group}</texts>,
$ft-db-name,
map { 'ftindex': true(), 'language': $lang }
)
};

Cheers
Kristian K
Post by Xavier-Laurent SALVADOR
Hi,
After reading Christian answer ( :-) ); I thought it could be
----------------------------------
distinct-values(
)
!
db:create(
'db-' || .,
<root xml:lang="{.}">
{
for $file in file:children('/Users/xavier/Desktop/')[matches(.,'xml$')]
return
}
</root>,
"myfile",
map { 'ftindex': true(), 'language': . }
)
----------------------------------
Hi Kristian,
It is currently not possible to work with different languages in a
single database. This is mostly because all normalized tokens will end
up in the same internal index, and it would be a lot of effort to
diversify this software behavior.
As Xavier pointed out (thanks!), the best way indeed is to create
different databases, one per language. The following example has been
inspired by Xavier’s proposal; it groups all files by their language
for $path-group in file:children('input-dir')
where ends-with($path-group, '.xml')
return db:create(
'db-' || $lang,
$path-group,
(),
map { 'ftindex': true(), 'language': $lang }
)
Hope this helps,
Christian
On Tue, Jun 27, 2017 at 5:19 PM, Xavier-Laurent SALVADOR
Post by Xavier-Laurent SALVADOR
Hi Kristian,
This is useful for creating automatically databases according to
xml:lang
Post by Xavier-Laurent SALVADOR
attribute
let $dir := '/Users/me/myDesktop/'
for $file in file:list($dir)[matches(.,'xml')]
return
return
true(),'language':$flag })
Or you can "ft:tokenize" your string mapping {'language':$flag}
into your
Post by Xavier-Laurent SALVADOR
query
Hope I understood the problem :) Else return 'sorry'
2017-06-27 16:57 GMT+02:00 Kristian Kankainen
Post by Kristian Kankainen
Hello
I have documents with text in several languages. When creating
a database
Post by Xavier-Laurent SALVADOR
Post by Kristian Kankainen
in BaseX I can choose *one* language for stemming for the
full-text search
Post by Xavier-Laurent SALVADOR
Post by Kristian Kankainen
index. Is there a way BaseX could lemmatize according to the
elements
Post by Xavier-Laurent SALVADOR
Post by Kristian Kankainen
xml:lang attribute?
Best regards
Kristian K
--
Ce message peut contenir des informations réservées
exclusivement à son
Post by Xavier-Laurent SALVADOR
destinataire. Toute diffusion sans autorisation est interdite.
Si vous n'en
Post by Xavier-Laurent SALVADOR
êtes pas le destinataire, merci de prendre contact avec
l'expéditeur et de
Post by Xavier-Laurent SALVADOR
détruire ce message.
This email may contain material for the sole use of the intended
recipient.
Post by Xavier-Laurent SALVADOR
Any forwarding without express permission is prohibited. If you
are not the
Post by Xavier-Laurent SALVADOR
intended recipient, please contact the sender and delete all copies.
--
Ce message peut contenir des informations réservées exclusivement à
son destinataire. Toute diffusion sans autorisation est interdite. Si
vous n'en êtes pas le destinataire, merci de prendre contact avec
l'expéditeur et de détruire ce message.
/This email may contain material for the sole use of the intended
recipient. Any forwarding without express permission is prohibited. If
you are not the intended recipient, please contact the sender and
delete all copies/.
Lizzi, Vincent
2017-07-01 00:29:50 UTC
Permalink
Kristian,

Out of curiosity, how are you linking the normalized texts in the -ft- database to the source documents? Is keeping a reference from the indexed text back to the source document a requirement in your application?

Thanks,
Vincent

From: basex-talk-***@mailman.uni-konstanz.de [mailto:basex-talk-***@mailman.uni-konstanz.de] On Behalf Of Kristian Kankainen
Sent: Friday, June 30, 2017 5:27 PM
To: Xavier-Laurent SALVADOR <***@gmail.com>; Christian GrÃŒn <***@gmail.com>
Cc: BaseX <basex-***@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Full-text lemmatizing and xml:lang


Hello

Sorry for being slow in reception, being a full-time father of two kids is my only excuse.

Thank you for enlightening answers. At first creating a separate database felt wrong and stupid, but after a while it felt just right and helping to organize different language elements via aggregation instead of composition.

Here is what I came up with:

(:~
This function takes a list of database names and optionally a list of language codes.
It creates separate full-text indexed databases for lemmatized searching of each language contained in the original database.
If the list of language codes is empty, all existing values of xml:lang found in the database is used.
The full-text databases are named 'dbname-ft-langcode'
Another function normalizes the texts, removes duplicate entries and inserts xml:id attributes
:)
declare updating function keeleleek:create-ft-indices-for-each-lang(
$db-names as xs:string*,
$lang-codes as xs:string*
) {
for $db-name in $db-names
let $langs := if( not( empty( $lang-codes )))
then( $lang-codes )
else( distinct-values(db:open($db-name)//@xml:lang) )
for $lang in $langs
let $lang-group := db:open($db-name)//*[@xml:lang = $lang]
let $ft-db-name := concat($db-name, '-ft-', $lang)

(: create full-text db for each language :)
return
db:create(
$ft-db-name,
<texts>{$lang-group}</texts>,
$ft-db-name,
map { 'ftindex': true(), 'language': $lang }
)
};

Cheers
Kristian K

28.06.2017 09:45 Xavier-Laurent SALVADOR kirjutas:
Hi,

After reading Christian answer ( :-) ); I thought it could be interesting to sort your docs according to @xml:lang and create a new DB next to your corpus :

----------------------------------
distinct-values(
file:children('input-dir')<file://children('input-dir')>[matches(.,'xml$')] ! (doc(.)//@xml:lang)
)
!
db:create(
'db-' || .,
<root xml:lang="{.}">
{
for $file in file:children('/Users/xavier/Desktop/')<file://children('/Users/xavier/Desktop/')>[matches(.,'xml$')]
return
<text src='{$file}'>{doc($file)//*[@xml:lang=.]//text()}</text>
}
</root>,
"myfile",
map { 'ftindex': true(), 'language': . }
)
----------------------------------



2017-06-27 20:49 GMT+02:00 Christian GrÃŒn <***@gmail.com<mailto:***@gmail.com>>:
Hi Kristian,

It is currently not possible to work with different languages in a
single database. This is mostly because all normalized tokens will end
up in the same internal index, and it would be a lot of effort to
diversify this software behavior.

As Xavier pointed out (thanks!), the best way indeed is to create
different databases, one per language. The following example has been
inspired by Xavier’s proposal; it groups all files by their language
and adopts the language in the name of the database:

for $path-group in file:children('input-dir')<file://children('input-dir')>
where ends-with($path-group, '.xml')
group by $lang := ($path-group//@xml:lang)[1]
return db:create(
'db-' || $lang,
$path-group,
(),
map { 'ftindex': true(), 'language': $lang }
)

Hope this helps,
Christian




On Tue, Jun 27, 2017 at 5:19 PM, Xavier-Laurent SALVADOR
Post by Xavier-Laurent SALVADOR
Hi Kristian,
This is useful for creating automatically databases according to xml:lang
attribute
let $dir := '/Users/me/myDesktop/'
for $file in file:list($dir)<file://list($dir)>[matches(.,'xml')]
return
return
true(),'language':$flag })
Or you can "ft:tokenize" your string mapping {'language':$flag} into your
query
Hope I understood the problem :) Else return 'sorry'
Hello
I have documents with text in several languages. When creating a database
in BaseX I can choose *one* language for stemming for the full-text search
index. Is there a way BaseX could lemmatize according to the elements
xml:lang attribute?
Best regards
Kristian K
--
Ce message peut contenir des informations réservées exclusivement à son
destinataire. Toute diffusion sans autorisation est interdite. Si vous n'en
êtes pas le destinataire, merci de prendre contact avec l'expéditeur et de
détruire ce message.
This email may contain material for the sole use of the intended recipient.
Any forwarding without express permission is prohibited. If you are not the
intended recipient, please contact the sender and delete all copies.
--
Ce message peut contenir des informations réservées exclusivement à son destinataire. Toute diffusion sans autorisation est interdite. Si vous n'en êtes pas le destinataire, merci de prendre contact avec l'expéditeur et de détruire ce message.

This email may contain material for the sole use of the intended recipient. Any forwarding without express permission is prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Kristian Kankainen
2017-07-02 09:23:02 UTC
Permalink
Perhaps a proposal below.
Post by Christian Grün
It is currently not possible to work with different languages in a
single database. This is mostly because all normalized tokens will end
up in the same internal index, and it would be a lot of effort to
diversify this software behavior.
How is the behavior if the database content is in many different
languages and is correctly marked with xml:lang attributes. Does the
full-text index consider this information and apply full-text indexing
only to elements with matching language?

As a simple illustration (does not run): will the following code create
full-text index only for the Russian text or for both the russian and
the English?

db:create(
'db-ft-ru',
<texts>
<text xml:lang="ru">something in Russian</text>
<text xml:lang="en">something in English</text>
</texts>,
texts,
map { 'ftindex': true(), 'language': 'ru' }
)

If BaseX does create the full-text index for both languages (the English
index would contain useless scramble) I would propose a simple filtering
of xml:lang tags according to the language given in the map to ftindex.
This should be simpler to implement than the diversifying as suggested
by Christian.

Best regards
Kristian K
Christian Grün
2017-07-02 09:36:36 UTC
Permalink
Hi Kristian,

Right now, xml:lang attributes are completely ignored when indexing
full-text. It’s an interesting idea to exclude texts that are marked
with languages different to the one that is currently applied; I will
think about it.

However, I should have mentioned that the language option is mostly
irrelevant unless you use stemmers. Tokenization is pretty much the
same for Western texts, so searches like the following one…

'Добрый ДЕНЬ!' contains text 'день'
using language 'en'

…will still give you the expected result. To some extent, this also
applies to Arabian texts:

'يوم سعيد' contains text 'يوم'
using language 'en'

Things are definitely different if you work with Japanese or Chinese
texts. The following query yields false:

'今日は' contains text '今'
using language 'en'

For more information on Japanese tokenization, see Toshio HIRAI’s
article in our wiki [1].

Hope this helps,
Christian

[1] http://docs.basex.org/wiki/Full-Text:_Japanese
How is the behavior if the database content is in many different languages
and is correctly marked with xml:lang attributes. Does the full-text index
consider this information and apply full-text indexing only to elements with
matching language?
As a simple illustration (does not run): will the following code create
full-text index only for the Russian text or for both the russian and the
English?
db:create(
'db-ft-ru',
<texts>
<text xml:lang="ru">something in Russian</text>
<text xml:lang="en">something in English</text>
</texts>,
texts,
map { 'ftindex': true(), 'language': 'ru' }
)
If BaseX does create the full-text index for both languages (the English
index would contain useless scramble) I would propose a simple filtering of
xml:lang tags according to the language given in the map to ftindex. This
should be simpler to implement than the diversifying as suggested by
Christian.
Best regards
Kristian K
Kristian Kankainen
2017-07-03 12:02:37 UTC
Permalink
Hi Christian,

To refine the proposal. It would be great if the full-text index could
be set up to consider xml:lang attributes in the following way:

* If STEMMING is set to true, then the input to the stemmer should be
filtered by matching the xml:lang and the LANGUAGE option. Text that is
sent to the tokenizer could be left as is and not be filtered by
matching LANGUAGE (see next point).

* If STEMMING is set to false, I agree with you that the general
strategy for tokenization is okay. But for correctness it still could be
extended to exclude all those scripts that doesn't follow
Western-centric tokenization algorithms.

* What concerns the DIACRITICS sensitivity option, probably what is
given by Unicode and the collation used by the query is good enough.

What do you think?

Best regards
Kristian K
Post by Xavier-Laurent SALVADOR
Hi Kristian,
Right now, xml:lang attributes are completely ignored when indexing
full-text. It’s an interesting idea to exclude texts that are marked
with languages different to the one that is currently applied; I will
think about it.
However, I should have mentioned that the language option is mostly
irrelevant unless you use stemmers. Tokenization is pretty much the
same for Western texts, so searches like the following one…
'Добрый ДЕНЬ!' contains text 'день'
using language 'en'
…will still give you the expected result. To some extent, this also
'يوم سعيد' contains text 'يوم'
using language 'en'
Things are definitely different if you work with Japanese or Chinese
'今日は' contains text '今'
using language 'en'
For more information on Japanese tokenization, see Toshio HIRAI’s
article in our wiki [1].
Hope this helps,
Christian
[1] http://docs.basex.org/wiki/Full-Text:_Japanese
How is the behavior if the database content is in many different languages
and is correctly marked with xml:lang attributes. Does the full-text index
consider this information and apply full-text indexing only to elements with
matching language?
As a simple illustration (does not run): will the following code create
full-text index only for the Russian text or for both the russian and the
English?
db:create(
'db-ft-ru',
<texts>
<text xml:lang="ru">something in Russian</text>
<text xml:lang="en">something in English</text>
</texts>,
texts,
map { 'ftindex': true(), 'language': 'ru' }
)
If BaseX does create the full-text index for both languages (the English
index would contain useless scramble) I would propose a simple filtering of
xml:lang tags according to the language given in the map to ftindex. This
should be simpler to implement than the diversifying as suggested by
Christian.
Best regards
Kristian K
Christian Grün
2017-07-03 16:50:28 UTC
Permalink
Post by Kristian Kankainen
* If STEMMING is set to true, then the input to the stemmer should be
filtered by matching the xml:lang and the LANGUAGE option. Text that is sent
to the tokenizer could be left as is and not be filtered by matching
LANGUAGE (see next point).
So you would prefer to have all words indexed, but reduce the stemming
step to the chosen language, right?

To give an example:

<xml>
<div xml:lang='de'>Häuser</div>
<div xml:lang='en'>houses</div>
</xml>

If stemming is enabled, and if language is 'de', the index would
include the two terms 'Haus' (stemmed German form) and 'Houses'
(original English form).

The query…

//div[text() contains text { "houses","Häuser" }
using language 'de'
using stemming
]

…would only return the German div element (as the German stemmer rewrites
'Häuser' to 'Haus' and 'houses' to 'hou').
Kristian Kankainen
2017-07-04 06:32:46 UTC
Permalink
Yes, you are correct.

During index building, only <div xml:lang='de'>Häuser</div> is
lemmatized, thus

//div[text() contains text { "houses","Häuser" }
using language 'de'
using stemming
]

returns only the element with Häuser. But a query without stemming and
language:

//div[text() contains text { "houses","Häuser" }]

would return both elements.

Best regards
Kristian K
Post by Christian Grün
Post by Kristian Kankainen
* If STEMMING is set to true, then the input to the stemmer should be
filtered by matching the xml:lang and the LANGUAGE option. Text that is sent
to the tokenizer could be left as is and not be filtered by matching
LANGUAGE (see next point).
So you would prefer to have all words indexed, but reduce the stemming
step to the chosen language, right?
<xml>
<div xml:lang='de'>Häuser</div>
<div xml:lang='en'>houses</div>
</xml>
If stemming is enabled, and if language is 'de', the index would
include the two terms 'Haus' (stemmed German form) and 'Houses'
(original English form).
The query…
//div[text() contains text { "houses","Häuser" }
using language 'de'
using stemming
]
…would only return the German div element (as the German stemmer rewrites
'Häuser' to 'Haus' and 'houses' to 'hou').
Christian Grün
2017-07-04 08:07:30 UTC
Permalink
Thanks. I’ll keep this proposal in mind, and think about further
implications. If we decided one day to make the full-text index
updatable (which would be a nice feature, but a lot of work), we would
probably need to reindex sub-trees with modified language attributes.



On Tue, Jul 4, 2017 at 8:32 AM, Kristian Kankainen
Post by Kristian Kankainen
Yes, you are correct.
During index building, only <div xml:lang='de'>Häuser</div> is lemmatized,
thus
//div[text() contains text { "houses","Häuser" }
using language 'de'
using stemming
]
returns only the element with Häuser. But a query without stemming and
//div[text() contains text { "houses","Häuser" }]
would return both elements.
Best regards
Kristian K
Post by Christian Grün
Post by Kristian Kankainen
* If STEMMING is set to true, then the input to the stemmer should be
filtered by matching the xml:lang and the LANGUAGE option. Text that is sent
to the tokenizer could be left as is and not be filtered by matching
LANGUAGE (see next point).
So you would prefer to have all words indexed, but reduce the stemming
step to the chosen language, right?
<xml>
<div xml:lang='de'>Häuser</div>
<div xml:lang='en'>houses</div>
</xml>
If stemming is enabled, and if language is 'de', the index would
include the two terms 'Haus' (stemmed German form) and 'Houses'
(original English form).
The query…
//div[text() contains text { "houses","Häuser" }
using language 'de'
using stemming
]
…would only return the German div element (as the German stemmer rewrites
'Häuser' to 'Haus' and 'houses' to 'hou').
Kristian Kankainen
2017-07-01 07:56:00 UTC
Permalink
Xavier-Laurent SALVADOR
2017-07-01 08:08:15 UTC
Permalink
Hello Guys,

reference from the indexed text back to the source document should be
globally maintained in an @src attribute (for example) and should obviously
automatically be maintained by term-to-term full-Text query, so:

let $pInTargetFtLang := (:some sentence of 2 or more words in
db:open('-ft-lang'):)
let $refIdInSource := doc('source.xml')//*[. contains text {$p} all words
ordered]/id

should retrieve and maintain the original path. This is not "wrong and
stupid" ;-), but it's a standard way of building fragmented linguistics
corpus in order ta reassemble them later in Rest app. A corpus is not just
one database: it's a set of databases you have to mix for clients use and
display.

br,
x
Hello.
It's a dictionary and words in the dictionary reference the texts with
simple identifiers. The texts are used as examples and can be referenced by
many different words. Thus no explicit bookkeeping is kept about which
words use which texts as examples, this is rather done implicitly and each
text reference (from the dictionary to -ft-) also holds much extra
information about the quality of the example text according to the word and
such. Statistics and summaries of this data is done by separate queries and
not held explicitly in either database.
Not sure whether I answered your question?
BR
Kristian K
Kristian,
Out of curiosity, how are you linking the normalized texts in the -ft-
database to the source documents? Is keeping a reference from the indexed
text back to the source document a requirement in your application?
Thanks,
Vincent
Kankainen
*Sent:* Friday, June 30, 2017 5:27 PM
*Subject:* Re: [basex-talk] Full-text lemmatizing and xml:lang
Hello
Sorry for being slow in reception, being a full-time father of two kids is
my only excuse.
Thank you for enlightening answers. At first creating a separate database
felt wrong and stupid, but after a while it felt just right and helping to
organize different language elements via aggregation instead of composition.
(:~
This function takes a list of database names and optionally a list of
language codes.
It creates separate full-text indexed databases for lemmatized searching
of each language contained in the original database.
If the list of language codes is empty, all existing values of xml:lang
found in the database is used.
The full-text databases are named 'dbname-ft-langcode'
Another function normalizes the texts, removes duplicate entries and
inserts xml:id attributes
:)
declare updating function keeleleek:create-ft-indices-for-each-lang(
$db-names as xs:string*,
$lang-codes as xs:string*
) {
for $db-name in $db-names
let $langs := if( not( empty( $lang-codes )))
then( $lang-codes )
)
for $lang in $langs
let $ft-db-name := concat($db-name, '-ft-', $lang)
(: create full-text db for each language :)
return
db:create(
$ft-db-name,
<texts>{$lang-group}</texts>,
$ft-db-name,
map { 'ftindex': true(), 'language': $lang }
)
};
Cheers
Kristian K
Hi,
After reading Christian answer ( :-) ); I thought it could be interesting
----------------------------------
distinct-values(
)
!
db:create(
'db-' || .,
<root xml:lang="{.}">
{
for $file in file:children('/Users/xavier/Desktop/')[matches(.,'xml$')]
return
}
</root>,
"myfile",
map { 'ftindex': true(), 'language': . }
)
----------------------------------
Hi Kristian,
It is currently not possible to work with different languages in a
single database. This is mostly because all normalized tokens will end
up in the same internal index, and it would be a lot of effort to
diversify this software behavior.
As Xavier pointed out (thanks!), the best way indeed is to create
different databases, one per language. The following example has been
inspired by Xavier’s proposal; it groups all files by their language
for $path-group in file:children('input-dir')
where ends-with($path-group, '.xml')
return db:create(
'db-' || $lang,
$path-group,
(),
map { 'ftindex': true(), 'language': $lang }
)
Hope this helps,
Christian
On Tue, Jun 27, 2017 at 5:19 PM, Xavier-Laurent SALVADOR
Post by Xavier-Laurent SALVADOR
Hi Kristian,
This is useful for creating automatically databases according to xml:lang
attribute
let $dir := '/Users/me/myDesktop/'
for $file in file:list($dir)[matches(.,'xml')]
return
return
true(),'language':$flag })
Or you can "ft:tokenize" your string mapping {'language':$flag} into your
query
Hope I understood the problem :) Else return 'sorry'
Post by Kristian Kankainen
Hello
I have documents with text in several languages. When creating a
database
Post by Xavier-Laurent SALVADOR
Post by Kristian Kankainen
in BaseX I can choose *one* language for stemming for the full-text
search
Post by Xavier-Laurent SALVADOR
Post by Kristian Kankainen
index. Is there a way BaseX could lemmatize according to the elements
xml:lang attribute?
Best regards
Kristian K
--
Ce message peut contenir des informations réservées exclusivement à son
destinataire. Toute diffusion sans autorisation est interdite. Si vous
n'en
Post by Xavier-Laurent SALVADOR
êtes pas le destinataire, merci de prendre contact avec l'expéditeur et
de
Post by Xavier-Laurent SALVADOR
détruire ce message.
This email may contain material for the sole use of the intended
recipient.
Post by Xavier-Laurent SALVADOR
Any forwarding without express permission is prohibited. If you are not
the
Post by Xavier-Laurent SALVADOR
intended recipient, please contact the sender and delete all copies.
--
Ce message peut contenir des informations réservées exclusivement à son
destinataire. Toute diffusion sans autorisation est interdite. Si vous
n'en êtes pas le destinataire, merci de prendre contact avec l'expéditeur
et de détruire ce message.
*This email may contain material for the sole use of the intended
recipient. Any forwarding without express permission is prohibited. If you
are not the intended recipient, please contact the sender and delete all
copies*.
--
Ce message peut contenir des informations réservées exclusivement à son
destinataire. Toute diffusion sans autorisation est interdite. Si vous
n'en êtes pas le destinataire, merci de prendre contact avec l'expéditeur
et de détruire ce message.

*This email may contain material for the sole use of the intended
recipient. Any forwarding without express permission is prohibited. If you
are not the intended recipient, please contact the sender and delete all
copies*.
Loading...