You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Document-Processing/PDF/PDF-Library/NET/Working-with-OCR/Features.md
+235-4Lines changed: 235 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -152,7 +152,7 @@ You can downloaded a complete working sample from [GitHub](https://github.com/Sy
152
152
153
153
## Performing OCR with tesseract version 3.05
154
154
155
-
The [TesseractVersion](https://help.syncfusion.com/cr/document-processing/Syncfusion.OCRProcessor.OCRSettings.html#Syncfusion_OCRProcessor_OCRSettings_TesseractVersion) property is used to switch the tesseract version between 3.02 and 3.05. By default, OCR works with tesseract version 3.02.
155
+
The [TesseractVersion](https://help.syncfusion.com/cr/document-processing/Syncfusion.OCRProcessor.OCRSettings.html#Syncfusion_OCRProcessor_OCRSettings_TesseractVersion) property is used to switch the tesseract version between 3.02 and 3.05. By default, OCR works with tesseract version 5.0.
156
156
157
157
N> The starting supported version of tesseract in ASP.NET Core is 4.0. So the lower tesseract versions 3.02 and 3.05 are not supported and we don't have the property called ``TesseractVersion`` in ASP.NET Core platform.
158
158
@@ -216,9 +216,7 @@ End Using
216
216
217
217
## Performing OCR with Tesseract Version 4.0
218
218
219
-
The [TesseractVersion](https://help.syncfusion.com/cr/document-processing/Syncfusion.OCRProcessor.OCRSettings.html#Syncfusion_OCRProcessor_OCRSettings_TesseractVersion) property is used to switch the tesseract version to 4.0. By default, OCR will be performed with tesseract version 3.02.
220
-
221
-
N> In ASP.NET Core platform, the default and starting supported version of tesseract is 4.0. So we did not have the property called ``TesseractVersion`` in ASP.NET Core platform.
219
+
The [TesseractVersion](https://help.syncfusion.com/cr/document-processing/Syncfusion.OCRProcessor.OCRSettings.html#Syncfusion_OCRProcessor_OCRSettings_TesseractVersion) property is used to switch the tesseract version to 4.0. By default, OCR will be performed with tesseract version 5.0.
222
220
223
221
The following code sample explains the OCR processor with Tesseract version 4.0 for PDF documents.
224
222
@@ -277,6 +275,67 @@ End Using
277
275
278
276
{% endtabs %}
279
277
278
+
## Performing OCR with Tesseract Version 5.0
279
+
280
+
The [TesseractVersion](https://help.syncfusion.com/cr/document-processing/Syncfusion.OCRProcessor.OCRSettings.html#Syncfusion_OCRProcessor_OCRSettings_TesseractVersion) property is used to switch the tesseract version to 5.0. By default, OCR will be performed with tesseract version 5.0.
281
+
282
+
The following code sample explains the OCR processor with Tesseract version 5.0 for PDF documents.
283
+
284
+
{% tabs %}
285
+
286
+
{% highlight c# tabtitle="C# [Cross-platform]" %}
287
+
288
+
using Syncfusion.OCRProcessor;
289
+
using Syncfusion.Pdf.Parsing;
290
+
291
+
//Initialize the OCR processor.
292
+
using (OCRProcessor processor = new OCRProcessor())
293
+
{
294
+
//Load an existing PDF document.
295
+
PdfLoadedDocument document = new PdfLoadedDocument("Input.pdf");
'Perform OCR with input document, tessdata (Language packs) and enabling isMemoryOptimized property.
327
+
processor.PerformOCR(document)
328
+
329
+
'Save the PDF document.
330
+
document.Save("Output.pdf")
331
+
'Close the document.
332
+
document.Close(True)
333
+
End Using
334
+
335
+
{% endhighlight %}
336
+
337
+
{% endtabs %}
338
+
280
339
## Performing OCR on image
281
340
282
341
The below code example illustrates how to perform OCR on image file using [PerformOCR](https://help.syncfusion.com/cr/document-processing/Syncfusion.OCRProcessor.OCRProcessor.html#Syncfusion_OCRProcessor_OCRProcessor_PerformOCR_System_Drawing_Bitmap_System_String_) method in [OCRProcessor](https://help.syncfusion.com/cr/document-processing/Syncfusion.OCRProcessor.OCRProcessor.html) class.
@@ -1005,6 +1064,178 @@ End Using
1005
1064
1006
1065
N> The OCR Engine Mode is supported only in the Tesseract version 4.0 and above.
1007
1066
1067
+
## Performing OCR with different OCR Image Enhancement Mode
1068
+
1069
+
The `ImageEnhancementMode` property is used to set the OCR image enhancement modes. By default, OCR works with the `EnhanceForRecognitionOnly` image enhancement mode. Kindly refer to the following code example to perform OCR with different OCR image enhancement segmentation mode.
1070
+
1071
+
The following table describes the available OCR image enhancement modes and their respective purposes.
1072
+
1073
+
<table>
1074
+
<thead>
1075
+
<tr>
1076
+
<th>
1077
+
OCR Image Enhancement Mode<br/><br/></th><th>
1078
+
Description<br/><br/></th></tr>
1079
+
</thead>
1080
+
<tbody>
1081
+
<tr>
1082
+
<td>
1083
+
EnhanceForRecognitionOnly<br/><br/></td><td>
1084
+
Image is enhanced internally to improve OCR accuracy, but the original image is retained in the output.<br/><br/></td></tr>
1085
+
<tr>
1086
+
<td>
1087
+
EnhanceAndIncludeInOutput<br/><br/></td><td>
1088
+
Image is enhanced and the enhanced version is used in the output document.<br/><br/></td></tr>
1089
+
<tr>
1090
+
<td>
1091
+
None<br/><br/></td><td>
1092
+
No image enhancement is performed. The original image is used for OCR processing.<br/><br/></td></tr>
1093
+
</tbody>
1094
+
</table>
1095
+
1096
+
{% tabs %}
1097
+
1098
+
{% highlight c# tabtitle="C# [Cross-platform]" %}
1099
+
1100
+
using Syncfusion.OCRProcessor;
1101
+
using Syncfusion.Pdf.Parsing;
1102
+
1103
+
// Initialize the OCR processor
1104
+
using (OCRProcessor processor = new OCRProcessor())
1105
+
{
1106
+
// Load an existing PDF document
1107
+
PdfLoadedDocument document = new PdfLoadedDocument("Input.pdf");
1108
+
// Set the OCR language to English for text recognition.
1109
+
processor.Settings.Language = Languages.English;
1110
+
// Set the OCR image enhancement mode to improve recognition accuracy.
' Perform OCR with input document and tessdata (Language packs)
1137
+
processor.PerformOCR(document)
1138
+
'Save the processed PDF document
1139
+
document.Save("Output.pdf")
1140
+
' Close the document
1141
+
document.Close(True)
1142
+
1143
+
End Using
1144
+
1145
+
{% endhighlight %}
1146
+
{% endtabs %}
1147
+
1148
+
## Performing OCR with different OCR Image Enhancement options
1149
+
1150
+
The `ImageEnhancementMode` property is used to set the OCR image enhancement mode. Refer to the following code example to perform OCR with different image enhancement options.
1151
+
1152
+
The following table describes the available OCR image enhancement options and their respective purposes.
1153
+
1154
+
<table>
1155
+
<thead>
1156
+
<tr>
1157
+
<th>
1158
+
OCR Image Enhancement options<br/><br/></th><th>
1159
+
Description<br/><br/></th></tr>
1160
+
</thead>
1161
+
<tbody>
1162
+
<tr>
1163
+
<td>
1164
+
IsGrayscaleEnabled<br/><br/></td><td>
1165
+
Simplifies image data by removing color information, making text easier to detect.<br/><br/></td></tr>
1166
+
<tr>
1167
+
<td>
1168
+
IsDeskewEnabled<br/><br/></td><td>
1169
+
Corrects tilted or rotated text for proper alignment.<br/><br/></td></tr>
1170
+
<tr>
1171
+
<td>
1172
+
IsDenoiseEnabled<br/><br/></td><td>
1173
+
Removes speckles and artifacts that can interfere with character recognition.<br/><br/></td></tr>
1174
+
<tr>
1175
+
<td>
1176
+
IsConstrastEnabled<br/><br/></td><td>
1177
+
Enhances text visibility against the background.<br/><br/></td></tr>
1178
+
<tr>
1179
+
<td>
1180
+
IsBinarizeEnabled<br/><br/></td><td>
1181
+
Converts images to black-and-white for sharper text edges, using advanced thresholding methods.<br/><br/></td></tr>
1182
+
</tbody>
1183
+
</table>
1184
+
1185
+
{% tabs %}
1186
+
1187
+
{% highlight c# tabtitle="C# [Cross-platform]" %}
1188
+
1189
+
using Syncfusion.OCRProcessor;
1190
+
using Syncfusion.Pdf.Parsing;
1191
+
1192
+
// Initialize the OCR processor
1193
+
using (OCRProcessor processor = new OCRProcessor())
1194
+
{
1195
+
// Load an existing PDF document
1196
+
PdfLoadedDocument document = new PdfLoadedDocument("Input.pdf");
1197
+
// Set the OCR language to English for text recognition.
1198
+
processor.Settings.Language = Languages.English;
1199
+
// Set the options for image enhancement during the OCR process.
1200
+
OcrImageEnhancementOptions options = new OcrImageEnhancementOptions();
1201
+
// Enable grayscale conversion to improve OCR accuracy by reducing color noise.
1202
+
options.IsGrayscaleEnabled = true;
1203
+
// Perform OCR with input document and tessdata (Language packs)
'Initialize the OCR processor inside a Using block to ensure proper disposal.
1219
+
Using processor As New OCRProcessor()
1220
+
' Load an existing PDF document.
1221
+
Dim document As New PdfLoadedDocument("Input.pdf")
1222
+
'Set the OCR language to English for text recognition.
1223
+
processor.Settings.Language = Languages.English
1224
+
' Set the options for image enhancement during the OCR process.
1225
+
Dim options As New OcrImageEnhancementOptions()
1226
+
'Enable grayscale conversion to improve OCR accuracy by reducing color noise.
1227
+
options.IsGrayscaleEnabled = True
1228
+
' Perform OCR on the input document using tessdata (language packs).
1229
+
processor.PerformOCR(document)
1230
+
'Save the processed PDF document.
1231
+
document.Save("Output.pdf")
1232
+
' Close the document and release resources.
1233
+
document.Close(True)
1234
+
End Using
1235
+
1236
+
{% endhighlight %}
1237
+
{% endtabs %}
1238
+
1008
1239
## White List
1009
1240
1010
1241
The [WhiteList](https://help.syncfusion.com/cr/document-processing/Syncfusion.OCRProcessor.OCRSettings.html#Syncfusion_OCRProcessor_OCRSettings_WhiteList) property specifies a list of characters that the OCR engine is only allowed to recognize. If a character is not on the white list, it will not be included in the output OCR results. For more information, refer to the following code sample.
Copy file name to clipboardExpand all lines: Document-Processing/PDF/PDF-Library/NET/Working-with-OCR/Working-with-OCR.md
+15-12Lines changed: 15 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,15 @@ keywords: Assemblies
11
11
12
12
Optical character recognition (OCR) is a technology used to convert scanned paper documents in the form of PDF files or images into searchable and editable data.
13
13
14
-
The [Syncfusion<sup>®</sup> OCR processor library](https://www.syncfusion.com/document-processing/pdf-framework/net/pdf-library/ocr-process) has extended support to process OCR on scanned PDF documents and images with the help of Google’s [Tesseract](https://github.com/tesseract-ocr/tesseract) Optical Character Recognition engine.
14
+
The [Syncfusion<sup>®</sup> OCR processor library](https://www.syncfusion.com/document-processing/pdf-framework/net/pdf-library/ocr-process) has extended support to process OCR on scanned PDF documents and images with the help of Google’s [Tesseract](https://github.com/tesseract-ocr/tesseract) Optical Character Recognition engine.
15
+
16
+
An inbuilt `image preprocessor` has been added to the OCR to prepare images for optimal recognition. This step ensures cleaner input and reduces OCR errors. The preprocessor supports the following enhancements:
17
+
18
+
***Convert to Grayscale** – Simplifies image data by removing color information, making text easier to detect.
19
+
***Deskew** – Corrects tilted or rotated text for proper alignment.
20
+
***Denoise** – Removes speckles and artifacts that can interfere with character recognition.
21
+
***Apply Contrast Adjustment** – Enhances text visibility against the background.
22
+
***Apply Binarize** – Converts images to black-and-white for sharper text edges, using advanced thresholding methods
15
23
16
24
The Syncfusion<sup>®</sup> OCR processor library works seamlessly in various platforms: Azure App Services, Azure Functions, AWS Textract, Docker, WinForms, WPF, Blazor, ASP.NET MVC, ASP.NET Core with Windows, MacOS and Linux.
0 commit comments