mwiR/README.Rmd at main · MyWebIntelligence/mwiR · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
---
output: md_document
---

# mwiR 0.8.0 (Beta) The R Package of My Web Intelligence Project

<!-- badges: start -->

<!-- badges: end -->

## Purpose of My Web Intelligence

**Context and Objectives**

My Web Intelligence (MWI) is a project designed to meet the growing need for tools and methodologies in the field of digital methods in social sciences and information and communication sciences (ICS). The main objective is to map the digital ecosystem to identify key actors, assess their influence, and analyze their discourses and interactions. This project addresses the increasing centrality of digital information and interactions in various fields, including health, politics, culture, and beyond.

### About the Author

**Amar LAKEL**

Amar Lakel is a researcher in information and communication sciences, specializing in digital methods applied to social studies. He is currently a member of the MICA laboratory (Mediation, Information, Communication, Arts) at the University of Bordeaux Montaigne. His work focuses on the analysis of online discourse, mapping digital ecosystems, and the impact of digital technologies on social and cultural practices.

#### Online Profiles

-   **MICA Labo**: [MICA Labo Profile](https://mica.u-bordeaux-montaigne.fr/amar-lakel/)
-   **Google Scholar**: [Google Scholar Profile](https://scholar.google.com/citations?user=hqquhfwAAAAJ)
-   **ORCID**: [ORCID Profile](https://orcid.org/0000-0002-1234-5678)
-   **ResearchGate**: [ResearchGate Profile](https://www.researchgate.net/profile/Amar_Lakel)
-   **Academia**: [Academia Profile](https://univ-bordeaux.academia.edu/AmarLakel)
-   **Twitter**: [Twitter MyWebIntel Profile](https://twitter.com/mywebintel)
-   **LinkedIn**: [LinkedIn Profile](https://www.linkedin.com/in/amar-lakel-123456789/)

## Methodology

**Research Protocol**

The research protocol of MWI relies on a combination of quantitative and qualitative methods:

1.  **Data Extraction and Archiving**: Using crawl technologies to collect data from the web.
2.  **Data Qualification and Annotation**: Applying algorithms to analyze, classify, and annotate the data.
3.  **Data Visualization**: Developing dashboards and relational maps to interpret the results.

**Methodological Challenges**

The MWI project utilizes techniques from the sociology of controversies, social network analysis, and text mining methods to:

-   Analyze the strategic positions of speakers in a heterogeneous and complex digital corpus.
-   Identify and understand the dynamics of online discourses.
-   Map the relationships between different actors and their respective influences.

## Case Studies

**Diverse Cases**

1.  **Health Information**

-   **Asthma and Diabetes in Children**: Studies of online discourses related to these diseases to identify influential actors, understand their positions, and evaluate their impact on patients' perceptions and behaviors. [Source](https://journals.openedition.org/rfsic/8376)

2.  **Online Political Controversy**

-   **Juan Branco Project**: Analysis of discourses and influence surrounding the public figure Juan Branco, exploring the dynamics of positioning and controversy. [Source](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4584133)

3.  **Research Sociology**

-   **Digital Humanities**: Studies on the impact of digital technologies on humanities and social sciences, including how researchers use the web to disseminate and discuss their work. [Source](https://hal.science/hal-02485370)

**Results and Impact**

The results of these studies show that online discourses play a crucial role in shaping opinions and behaviors in various fields. They also highlight the importance for researchers and professionals to actively engage in these discussions to promote reliable and scientifically validated information.

## Repositories and Documentation

**NAKALA Repositories** The data and results of the MWI project are deposited on the NAKALA platform, providing open access for other researchers and practitioners. Here are some important repositories:

1.  [The collection](https://nakala.fr/collection/10.34847/nkl.b4aarv3j): Contains a detailed description of the project, methodology, and results.
2.  [Positions and Influences on the Web: The Case of Health Information](https://nakala.fr/prise-de-position): Detailed analysis of discourses on childhood asthma.
3.  [French Digital Humanities communities](https://nakala.fr/10.34847/nkl.f43by03n): A study case on French digital humanities development on the web.

# Development of the R Package

The R package developed within the framework of My Web Intelligence is designed to: - Facilitate the replication of analyses conducted in the project. - Enable the extension of developed methods and tools for other research. - Provide researchers and professionals with a powerful tool to understand and manage the dynamics of online information.

**Main Features**

-   **Project Management**: Tools to initiate and manage web exploration projects.
-   **Data Extraction**: Functions to crawl the web and extract data corpora.
-   **Analysis and Annotation**: Algorithms to analyze and annotate extracted data.
-   **Visualization**: Dashboards and maps to visualize relationships between actors and discourses.

## Conclusion

My Web Intelligence is an integrative project aimed at transforming how we understand and analyze digital information across various fields in social sciences and ICS. By combining innovative methodologies and advanced technological tools, MWI offers new perspectives on digital dynamics and proposes solutions to better understand online interactions and discourses. The R package developed from this project is an essential tool for researchers and practitioners, enabling them to fully exploit web data for in-depth and relevant analyses.

# Using mwiR : a study case

## Installation

You can install the development version of mwiR from [GitHub](https://github.com/) with:

```{r}

# install.packages("devtools")
devtools::install_github("MyWebIntelligence/mwiR")

```

## Project ('land') Setup

This is a basic example which shows you how to solve a common problem:

```{r example}
library(mwiR)
## basic example code
```

## Step 1: Creating the Research Project

In this step-by-step guide, we will walk through the initial setup and execution of a research project using the My Web Intelligence (MWI) method. This method allows researchers to analyze the impact of various factors, such as AI on work, by collecting and organizing web data. Here is a breakdown of the R script provided:

### 1. Load the Required Packages

```{r}
initmwi()
```

The `initmwi()` function initializes the My Web Intelligence environment by loading all necessary packages and setting up the environment for further operations. This function ensures that all dependencies and configurations are correctly initialized.

### 2. Set Up the Database

```{r}
db_setup()
```

The `db_setup()` function sets up the database needed for storing and managing the data collected during the research project. It initializes the necessary database schema and ensures that the database is ready for data insertion and retrieval.

-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.

### 3. Create a Research Project (Land)

```{r}
create_land(name = "AIWork", desc = "Impact of AI on work", lang="en")
```

The `create_land()` function creates a new research project, referred to as a "land" in MWI terminology. This land will serve as the container for all data and analyses related to the project.

-   `name`: A string specifying the name of the land.
-   `desc`: A string providing a description of the land.
-   `lang`: A string specifying the language of the land. Default is `"en"`.
-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.

### 4. Add Search Terms

```{r}
addterm("AIWork", "AI, artificial intelligence, work, employment, job, profession, labor market")
```

The `addterm()` function adds search terms to the project. These terms will be used to crawl and collect relevant web data.

-   `land_name`: A string specifying the name of the land.
-   `terms`: A comma-separated string of terms to add.

### 5. Verify the Project Creation

```{r}
listlands("AIWork")
```

The `listlands()` function lists all lands or projects that have been created. By specifying the project name "AIWork", it verifies that the project has been successfully created.

-   `land_name`: A string specifying the name of the land to list. If `NULL`, all lands are listed. Default is `NULL`.
-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.

### 6. Add URLs Manually or Using a File

```{r}
addurl("AIWork", urls = "https://www.fr.adp.com/rhinfo/articles/2022/11/la-disparition-de-certains-metiers-est-elle-a-craindre.aspx")
```

The `addurl()` function adds URLs to the project. These URLs point to web pages that contain relevant information for the research.

-   `land_name`: A string specifying the name of the land.
-   `urls`: A comma-separated string of URLs to add. Default is `NULL`.
-   `path`: A string specifying the path to a file containing URLs. Default is `NULL`.
-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.

Alternatively, URLs can be added using a text file:

```{r}
# If using a text file

addurl("AIWork", path = "_ai_or_artificial_intelligence___work_or_employment_or_job_or_profession_or_labor_market01.txt")
```

-   `path`: The path to a text file containing the URLs to be added.

### 7. List the Projects or a Specific Project

```{r}
listlands("AIWork")
```

This function is used again to list the projects or a specific project, ensuring that the URLs have been added correctly to "AIWork".

### 8. Optionally Delete a Project

```{r}
deleteland(land_name = "AIWork")

```

The `deleteland()` function deletes a specified project. This can be useful for cleaning up after the research is completed or if a project needs to be restarted.

-   `land_name`: A string specifying the name of the land to delete.
-   `maxrel`: An integer specifying the maximum relevance for expressions to delete. Default is `NULL`.
-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.

This script demonstrates the basic setup and execution of a research project using My Web Intelligence, including project creation, term addition, URL management, and project verification.

## Step 2: Crawling

In this section, we will walk through the process of crawling URLs and extracting content for analysis using the My Web Intelligence (MWI) method. The following R code snippets demonstrate how to perform these tasks.

### Crawl URLs for a Specific Land

```{r}
crawlurls("IATravail", limit = 10)
```

The `crawlurls()` function crawls URLs for a specific land, updates the database, and calculates relevance scores.

-   `land_name`: A character string representing the name of the land.
-   `urlmax`: An integer specifying the maximum number of URLs to be processed (default is 50).
-   `limit`: An optional integer specifying the limit on the number of URLs to crawl.
-   `http_status`: An optional character string specifying the HTTP status to filter URLs.
-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.

**Example:**

This example demonstrates crawling up to 10 URLs for the land named "IATravail".

```{r}
crawlurls("IATravail", limit = 10)
```

### Crawl Domains

```{r}
crawlDomain(1000)
```

The `crawlDomain()` function crawls domains and updates the Domain table with the fetched data.

-   `nburl`: An integer specifying the number of URLs to be crawled (default is 100).
-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.
...
(Le contenu intermédiaire du fichier a été conservé identique)
...
## Step 9: Maintain the Database Throughout the Project Lifecycle

The database layer underpins every land. The following helpers keep it healthy and synchronised with external edits.

### 1. Connect Programmatically and Reuse IDs

```{r}
con      <- connect_db()
land_id  <- get_land_id(con, "AIWork")
domaines <- list_domain(con, land_name = "AIWork")
```

-   `connect_db()` returns a ready-to-use `DBI` connection.
-   `get_land_id()` converts human-readable land names into numeric IDs when you automate workflows.
-   `list_domain()` produces a domain summary (counts, keywords) to monitor coverage.

### 2. Import Additional Material

```{r}
urls <- importFile()
addurl("AIWork", urls = urls$url)
```

Use `importFile()` whenever you enrich your corpus from spreadsheets or open postings. The helper returns a data frame; pass the relevant column to `addurl()`.

### 3. Reinstate Externally Annotated Data

```{r}
annotatedData(
  dataplus = curated_notes,
  table    = "Expression",
  champ    = "description",
  by       = "id"
)
```

`annotatedData()` wraps transactional updates so a batch edit either fully succeeds or rolls back. Always back up `mwi.db` before bulk reinsertion.

### 4. Export Precisely What You Need

Beyond `export_land()`, the family of dedicated exporters gives you fine-grained control:

-   `export_pagecsv()` and `export_fullpagecsv()` to share tabular corpora;
-   `export_nodecsv()` / `export_nodegexf()` for network analysis;
-   `export_mediacsv()` to audit associated media;
-   `export_pagegexf()` for expression-level graphs;
-   `export_corpus()` to assemble text files plus metadata headers (ideal for CAQDAS tools).

Each exporter accepts `minimum_relevance`, so you can balance breadth and focus depending on the audience.

<!-- staged: minor update to allow commit -->