Posts

Using Pyarrow: Quickstart Guide

This quickstart guides you through setting up a basic Pyarrow application for efficient data processing and interoperability. Requires Python 3.9+ and Pyarrow 15.0.0 (2025). You’ll have a working script that reads, writes, and manipulates data using Pyarrow’s core functionalities. After completion, you can explore advanced features like data serialization, integration with Apache Arrow, and performance optimization.

Mastering Browser Automation with Playwright

Playwright 1.35 provides a robust framework for end-to-end browser automation, enabling efficient testing and scraping across modern web applications. Getting Started with Playwright Installation and Setup # Playwright is a powerful tool for browser automation, supporting multiple programming languages including JavaScript/TypeScript, Python, Java, and C#. This guide covers the prerequisites, installation methods, and initial project setup, with emphasis on using the latest stable version and production-ready configurations. Prerequisites # Before installing Playwright, ensure your system meets the following requirements: Node.js: Latest 20.x, 22.x, or 24.x (for JavaScript/TypeScript projects) Python 3: Python 3.8+ is recommended (for Python projects) Java Development Kit (JDK): JDK 17+ is recommended (for Java projects) .NET SDK: .NET 6.0 or later (for C# projects) Operating System: Windows 11+, macOS 14 (Ventura) or later, or Linux distributions such as Debi...

Extracting Text from PDF in Python with PDFMiner

PDFMiner.six version 20251107 enables robust text extraction from PDF 1.7 documents with enhanced performance and security. Its support for Python 3.9+ and integration with image extraction via pdfminer.six[image] ensures compatibility and flexibility in processing complex PDFs. Advanced features like font and layout preservation allow for accurate semantic and visual reconstruction of text. To optimize extraction, use PDFTextStripper for basic needs or PyMuPDF for fine-grained control. For OCR-enhanced workflows, consider PaddleOCR’s PP-OCRv5 (40% accuracy improvement) or Mistral OCR 2503 (94.89% benchmark accuracy) when dealing with scanned documents.

Python for Data Engineering: Polars vs Pandas Performance Comparison

Polars and Pandas are both used for data manipulation in Python, with Polars emerging as a high-performance alternative in 2025. This comparison evaluates their architectural differences, execution speed, feature sets, and suitability for various data engineering workflows. Key distinctions include memory efficiency, parallel processing capabilities, and API design. The analysis covers Polars v0.17.0 and Pandas v2.2.0, focusing on technical trade-offs rather than subjective preference. Please see the post for details. https://dasroot.net/posts/2025/12/python-data-engineering-polars-vs-pandas-performance/ Conclusion Polars and Pandas both serve data engineering workflows but differ in performance and architecture. Polars 0.20.5 outperforms Pandas 2.2.2 by 3–10x on large datasets (1M+ rows) due to its lazy evaluation model and Rust backend, reducing memory usage and improving scalability. Pandas, however, offers deeper integration with ML libraries like Scikit-learn and visualizatio...